Survey of Affect Representation Schemes used in Automatic Affect Prediction for Speech Emotion Recognition: A Systematic Review
More Info
expand_more
Abstract
Automatic affect prediction systems usually assume its underlying affect representation scheme (ARS). This systematic review aims to explore how different ARS are used for in affect prediction systems based on spoken input. The focus is only on the audio input from speakers. Various datasets for speech emotion recognition were also involved in the study to understand the motivation for certain (categorical or dimensional) schemes used for emotions. The basis, popularity, advantages and target affective states were investigated. We used Scopus and Web of Science to extract the papers, focusing on the systems in the field of Computer Science in English language. In summary, our exploration of affect representation schemes in Speech Emotion Recognition (SER) reveals a predominant focus on categorical representations of affect, particularly variations of Ekman's six basic emotions. Behavior and attitude, although rare, are also represented sometimes. Emotions like anger, happiness, and sadness receive the most attention, while the recognition of the neutral state as an emotional state remains controversial. Dimensional affect representation schemes are less common, possibly due to the difficulty in estimating valence solely from audio input. Researchers often combine multiple categorical schemes to accommodate different datasets used in SER systems, aligning the popularity of the schemes with the corresponding datasets. However, issues such as a lack of explanation for chosen categories, interchangeable use of terminology, and a weak psychological foundation for category selection pose challenges in achieving a comprehensive understanding of affect representation in SER research.