Multimodal information extraction from videos

Strafforello, O.

Abstract

With the huge amount of data that is collected every day and shared on the internet, many recent studies have focused on methods to make multimedia browsing simple and efficient, investigating techniques for automatic multimedia analysis. This work specifically delves into the case of information extraction from videos, which is still an open challenge due to the combination of their semantic complexity and dynamic nature. The majority of the existing solutions are tailored for specific video categories and result in the creation of key frames time-lapses, video summaries, video overviews or highlight clips. In particular, this thesis project focuses on the case of highlights extraction from videos where one person speaks facing the camera. Automating the process of analysis of this specific kind of videos is important in the industrial context because it can be harnessed for several interesting applications, such as the automatic video summarisation of interviews or the automatic creation of personal video curricula vitae.

In this setting, the research objective is to investigate how Machine Learning can be deployed for the task of information extraction. From the target videos, multiple types of features can be extracted, such as textual features from the speech transcription; visual features from the facial expressions, head pose, eye gaze and hand gestures; audio features from the variations in the tone of the voice. ¬The exploitation of multimodal features enhances the capacity of Machine Learning algorithms. In fact, as proven in former research, the integration of multiple channels of information --- textual, audio, visual --- makes it possible to derive a more precise and greater amount of knowledge, just like humans exploit their multiple senses, in addition to experience, to make classifications or predictions. In this work, two approaches for multimodal information extraction from videos are investigated. The first approach is based on simple multimodal feature vectors concatenation, while the second approach exploits a recent deep architecture, the Memory Fusion Network by Zadeh, Amir, et al., to model both individual and combined temporal dynamics. To test the effectiveness of multimodal learning in the context of information extraction from videos, the two techniques are compared against a unimodal, content-based method, that relies on the summarisation of the video transcripts.

In order to train the multimodal approaches in a supervised fashion, a novel dataset based on videos of political speeches of well-known American politicians, the Political Speeches Dataset, was collected. The dataset is provided with binary saliency labels, that allow to identify the ground truth salient video segments. Four types of highlight clips are generated for each speech and evaluated through crowdsourcing. The results show that the quality of automatically created highlight clips is comparable to the ground truth, in terms of informativeness and ability to generate interest. Moreover, they also confirm that highlight clips generated with multimodal learning are more informative than the baseline.

Multimodal information extraction from videos

Automatic creation of highlight clips from political speeches

Abstract

Files