Computers having the ability to estimate intentions to speak can improve human-computer interaction. While plenty of research has been done on next-speaker prediction, they differ from intentions to speak since these rely only on the person themselves. Previous research was done
...
Computers having the ability to estimate intentions to speak can improve human-computer interaction. While plenty of research has been done on next-speaker prediction, they differ from intentions to speak since these rely only on the person themselves. Previous research was done on inferring intentions to speak using accelerometer data with some useful results. This paper expands on that research by adding non-verbal vocal behaviour as an additional modality, making the model multimodal. The model is trained on successful intentions to speak, and tested on successful and unsuccessful intentions to speak. Part of the dataset was annotated for unsuccessful intentions to speak and the signals in these annotations were analyzed. In conclusion, using non-verbal vocal behaviour is a much more reliable indicator of successful intentions to speak than accelerometer data. Using a combination of both improves the score slightly, but not significantly. Training on unsuccessful intentions to speak is likely needed to estimate these reliably. Additional modalities could be investigated to possibly improve the model further.