Recent studies have shown that gesture annotation schemes should account for the multidimensional nature of gestures and define their meaning in terms of referentiality and pragmatic meaning. However, accurately annotating gesture meaning in densely crowded social settings using
...
Recent studies have shown that gesture annotation schemes should account for the multidimensional nature of gestures and define their meaning in terms of referentiality and pragmatic meaning. However, accurately annotating gesture meaning in densely crowded social settings using such a coding scheme remains to be accomplished. This study uses the MultiModal MultiDimensional (M3D) labelling scheme and the EUDICO Linguistic Annotator (ELAN) tool to annotate video data from the Conference Living Lab (ConfLab) dataset. The ConfLab dataset contains 8 video recordings of standing conversations at a conference, captured from an overhead perspective, and low-frequency audio recordings of the conversations. A total of 1119 clips of individual gesture instances are generated. This data is then fed into a VideoMAE model pre-trained on the UCF101 dataset. The model achieves an overall accuracy score of 49% on the test set but shows a significant bias towards one class due to the imbalanced dataset. Due to the small size of the dataset and the similarities between gestures with different meanings, the model cannot identify different gesture types. The results demonstrate that high-frequency audio or transcripts of the conversations are vital to avoid strong and potentially incorrect assumptions when annotating gesture meaning. Further investigation is required into the annotation and classification of pragmatic meanings and Machine Learning solutions for multi-class, multi-label video classification problems.