Recognizing facial emotions is key for social interaction, yet the subjective nature of emotion labeling poses challenges for automatic facial affect prediction. Variability in how individuals interpret emotions leads to uncertainty in training data for machine learning models. W
...
Recognizing facial emotions is key for social interaction, yet the subjective nature of emotion labeling poses challenges for automatic facial affect prediction. Variability in how individuals interpret emotions leads to uncertainty in training data for machine learning models. While multiple raters and interrater agreement (IRA) measures are used to address this, the extent of their use and their impact on dataset reliability is not well understood. This systematic literature review investigates the methodologies used to measure IRA in facial affect recognition datasets. Concrete eligibility and feasibility criteria were applied, and it resulted in 47 papers being retrieved from Scopus, Web of Science, IEEExplore, and ACM Digital Library. Data on affect states, affect representation schemes (ARS), and IRA methodologies used by the datasets and their corresponding papers were extracted to provide a comprehensive overview and allow a detailed analysis. Clear correlation was not found in between ARS and IRA, but the retrieved data showed that Fleiss' kappa was the most popular methodology over time but also in the recent years.