Correspondence Between Perplexity Scores and Human Evaluation of Generated TV-Show Scripts

More Info
expand_more

Abstract

In recent
years many new text generation models have been developed while evaluation of
text generation remains a considerable challenge.  Currently, the only metric that is able to
fully capture the quality of a generated text is human evaluation, which is
expensive and time consuming. One of the most used intrinsic evaluation metrics
is perplexity. This paper researched the correspondence between perplexity
scores and human evaluation of scripts for the TV-show \textit{Friends}
generated using OpenAI's GPT-2 model.  This
was done by conducting a survey taken by 226 participants that evaluated
selected scripts on creativity, realism and coherence.  The survey results revealed that generations
with a perplexity value close to that of an actual Friends script perform best
on creativity, but score low on realism and coherence.  The most realistic and coherent generations
were those with a lower perplexity value, while the worst in all fields were
the generations with the highest perplexity value.  The research shows that perplexity is not an
adequate measure for the quality of generated TV-show scripts.   



Files