A Reliable Methodology to Collect Ground Truth Data of Image Aesthetic Appeal
More Info
expand_more
Abstract
Recognizing what makes an image aesthetically pleasing is crucial to the effectiveness of many multimedia systems. Several works have attempted to build image aesthetic appeal predictors, and created their own set of ground truth data for the purpose, either by using rated images from photo sharing websites, or by asking a pool of users to rate images in lab or crowdsourcing experiments. Literature has shown that the way these experiments are conducted can influence their results: poor experimental setup can result in poorly reliable outcomes (i.e., highly imprecise aesthetic appeal measures). A question then arises whether the different choices made to collect ground truth of aesthetic appeal data are appropriate. In this paper, we propose a systematic study that looks into how different experimental environments and rating scales used to collect image aesthetic appeal ground truth data influence the reliability and repeatability of aesthetic appeal assessments. Our findings show that discrete and continuous scales with five-point absolute category rating labels yield more reliable results, with the continuous scale being more reliable for abstract images. We also show that image aesthetic appeal assessments could be repeatable across different experimental environments (i.e., lab and crowdsourcing). We finally formulate concrete recommendations to guide the collection of large sets of ground truth data for training models of aesthetic appeal appreciation.