With the rise in the number of human-computer interactions, the need for systems that can accurately infer and respond to users' emotions becomes increasingly important. One can achieve this by examining audio-visual signals, aiming to identify the underlying emotions from an ind
...
With the rise in the number of human-computer interactions, the need for systems that can accurately infer and respond to users' emotions becomes increasingly important. One can achieve this by examining audio-visual signals, aiming to identify the underlying emotions from an individual's gestures, auditory cues, and surroundings. Such automatic affect prediction systems depend heavily on labeled datasets. However, the subjective nature of emotion interpretation often introduces uncertainty, making it challenging to create reliable and high-quality datasets. To combat this issue, researchers have tried employing multiple raters to judge the affective state of a person, while employing interrater agreement measures to monitor uncertainty. To this moment, it is still unclear to what extend does reaching a good level of interrater agreement impact the performance of audio-visual Automatic Affect Prediction models.
As a first step towards understanding the potential influences on performance, this paper conducts a systematic literature review to investigate the objective annotation procedure in audio-visual databases. The survey extracted relevant literature from 4 scientific databases: Scopus, IEEE Xplore, Web of Science and ACM Digital Library. The results are aggregated from 55 papers and presented by following the PRISMA guidelines. They indicate that most databases use multiple annotators, a little more than half measure interrater agreement, and most train the raters to increase the uniformity of the labels.