Towards the Automatic Assessment of Social Experience in in-the-wild Mingling Settings

More Info
expand_more

Abstract

Endowing machines with social competence is not only a science fiction theme. It is also a long-held goal in computer science. Machines have changed how we work, communicate, and do art, science, and engineering, but they have had little effect on one of our core human needs: social interaction. Although digital communication has changed the way we interact with others, machines have arguably done little to enhance the quality of our face-to-face interactions and are seldom seen as tools to help us improve the way we interact with others. This is in part due to their lack of social competence thus far.

A crucial stepping stone towards social competence and the ability to display empathy is the ability to assess social experience. Social experience refers to internal states reflecting an individual's perception of a social situation, like enjoying a conversation or feeling attracted to someone they are interacting with. Social experience variables are hard to study because they are not directly observable and change over time. Researchers must rely on self-reports or third-party assessments (annotations). Algorithms for assessment of social experience generally take one of two approaches: 1) Direct modeling of the relationship between raw/derived signals and experience variables, utilizing sensor readings or outputs of detectors and feature extractors; and 2) intermediate modeling/detection of discrete actions performed during interactions (ie. speaking, laughter, gesturing).

In this thesis dissertation, we focus on in-the-wild mingling setting, where subjects are standing and are free to form and switch conversation groups as they desire. Data collection and annotation are paid special attention due to their relevance in a nascent field and the nuance involved in collecting and annotating social signals. Because the goal is to study machine social perception in real-life settings, interactions are not scripted and instrumentation is kept to a minimum.

We start with work concerned with the direct assessment of social experience, in this case of attraction, by exploring the predictive power of body acceleration. By analyzing accelerometer data from speed dating interactions, we investigate how the intensity and variations in body movement relate to self-reported attraction levels. This study sheds light on the predictive power of synchrony, mimicry, and convergence estimates for predicting attraction, and potentially other constructs related to affiliation.

We then address the detection of speaking, an action of wide interest in social signal processing due to the relevance of turn-taking in social experience. We address the limitations posed by visual cross-contamination in crowded mingling settings. We introduce a model that employs accelerometer readings and body poses to enhance the robustness of speaking status detection in a complex scene, with multiple interactions occurring simultaneously.

The dissertation also presents two novel datasets: ConfLab and REWIND, each serving a unique purpose. ConfLab, collected during a conference, is notable for its annotations of body joints, and improvements to the sensor setup resulting in increased data fidelity. Such methodological contributions to enable efficient and high-quality data collection are increasingly valuable given the scarcity of social interaction datasets, particularly in mingling settings. REWIND, gathered at a business networking event, stands out with its high-quality individual audio recordings, useful for the cross-modal study of multimodal signals such as speaking or laughter.

In a similar line, we present the Covfee software framework. Covfee challenges existing annotation methodologies by introducing and studying interfaces for continuous annotation for keypoints and actions. This framework was instrumental in efficiently processing the vast amounts of data collected in studies like ConfLab by streamlining the annotation process.

Also building on the Covfee framework, the dissertation culminates in an exploration of laughter annotation across different modalities. By comparing laughter annotations acquired in different conditions, the research highlights the complexities and nuances involved in interpreting social signals across different sensory inputs. We challenge the assumption that laughter intensity should be considered a property of the laughter episode. Instead, we find evidence that laughter evaluations differ significantly depending on the modalities available to the observer and that modalities with higher agreement will not necessarily result in the highest model performance. These results not only contribute to the study of laughter detection but also provide valuable insights for future research on multimodal social signal processing.

In summary, this dissertation weaves together a series of methodological contributions and novel findings, often derived from these new methods, each contributing to further our understanding of how to best train machines for social understanding and competence.

Files

16497_Completed.pdf
(pdf | 28.8 Mb)
- Embargo expired in 07-10-2024
Unknown license