Emotional datasets for automatic affect prediction usually employ raters to annotate emotions or verify the annotations. To ensure the reliability of these raters some use interrater agreement measures, to verify the degree to which annotators agree with each other on what they r
...
Emotional datasets for automatic affect prediction usually employ raters to annotate emotions or verify the annotations. To ensure the reliability of these raters some use interrater agreement measures, to verify the degree to which annotators agree with each other on what they rate. This systematic review explores what kind of interrater agreement measures are used in emotional speech corpora. The affective states, the affect representation schemes, and the collection method of the datasets as well as the popularity of these measures were investigated. Scopus, IEEE Xplore, Web of Science, and ACM digital library were used to extract papers that describe the creation of datasets. 45 papers were included in the review. The review concludes that the interrater agreement measures used, are highly dependent on the collection method for speech and the affect representation schemes. It was found that there is no standardized way to measure interrater agreement. Datasets that use actors to record emulated emotions mostly use recognition rate as their interrater agreement measures. Datasets that use a dimensional representation scheme often compute the mean agreement of the raters and the standard deviation of that measure to check the interrater agreement. Datasets that do not use actors nor are dimensional use a plethora of different measures such as probabilistic computing of agreement, or majority agreement measures, but a large amount use no measures at all.