Multiactivity analysis investigates one's coordination of actions within a social context, such as gestures and speech, usually using video recordings of the social activity, to further understand the rules of human behaviour. This paper focuses specifically on the coordination b
...
Multiactivity analysis investigates one's coordination of actions within a social context, such as gestures and speech, usually using video recordings of the social activity, to further understand the rules of human behaviour. This paper focuses specifically on the coordination between speaking and drinking activities within a social setting, and explores the possibility of automatically identifying these events using audio captured from a drinking glass. As social interactions occur in vastly different contexts, this paper also investigates the effect that background noise might have on the accuracy of identifying these events. Different parameters and audio features were compared. Linear classification models LR and SVM with a linear kernel were able to achieve 100% accuracy for all sample lengths between 2 and 8 seconds using the first 20 PCA components from 60 audio features. The best performing feature in identifying speaking and drinking events was MFCCs, achieving an F1 score of 99.4% on average across models with a training sample length of 3 seconds. Background noise had different effects on classification accuracy depending on the type, with music lowering the F1 score to 74.3%, noisy room audio to 64.7%, and podcast audio simulating the presence of other speakers to 59.6% using MFCCs and a 3-second sample length