Deepfake Detection Using Convolutional Neural Networks

Working Towards Understanding the Effects of Design Choices

More Info
expand_more

Abstract

When building a convolutional neural network, many design choices have to be made. In the case of Deepfake detection, there is no readily implementable recipe that guides these choices. This research aims to work towards understanding the effects of design choices in the case of Deepfake detection, using the Python library Keras and publicly available datasets. The choices analysed are dataset composition, preprocessing, dropout rate, batch size, network architecture, and specific dataset. We also analyse the difference between training a network on original images and DCT-residuals of images. Lastly, we analyse the networks' generalisation and robustness capabilities. The goal of these experiments is to work towards a readily implementable recipe for Deepfake detection algorithms. Furthermore, this research provides an overview of image manipulation algorithms, an overview of recent research into convolutional networks, and an extensive overview of the Deepfake detection research field. To analyse dataset composition, we used different subsets of FaceForensics++, with different numbers of frames per video. We trained a shallow network, containing only four convolutional layers, on all three datasets. The dataset with one frame per video was the only one that did not result in immediate overfitting, although it contained less than two thousand frames in total. We continued with the small dataset and tested different preprocessing settings and dropout rates on our shallow network. We found that preprocessing and dropout were not able to increase the maximum achievable accuracy, although they were able to curtail overfitting. It is possible that accuracy did not increase due to the high variety of artefacts in FaceForensics++. Batch size also does not have any effect on the maximum achievable accuracy. However, the runtime required for training a network increases considerably as the batch size decreases. We tested DenseNet-121, Inception-v3, ResNet-152, VGG16, VGG19, Xception, and our shallow network on three different datasets: FaceForensics++, Celeb-df, and DeeperForensics-1.0. The goal of this experiment was to find what type of network is most suited for detecting Deepfakes with publicly available datasets of small size but high variation. We also wanted to see if there were differences in performance achieved on the different datasets. Although all networks except the shallow one were pretrained on ImageNet, three of the different networks tested immediately overfitted on all three datasets used: DenseNet-121, Inception-v3, and XceptionNet. The other networks encountered most difficulties with Celeb-df, on which none of the networks managed to reach 70% accuracy before overfitting. The easiest network to train on was DeeperForensics-1.0, on which our shallow network achieved 92.5% accuracy. However, when testing the networks' robustness, none of the networks trained on DeeperForensics-1.0 reached an accuracy higher than random on Celeb-df or FaceForensics-1.0. Shallow networks might be better suited for Deepfake detection on our small dataset. The shallow model and VGG16 achieved the highest accuracies. VGG19's performance is close to VGG16's. However, ResNet-152 is the deepest network used in this research and performs better than the shallower DenseNet, Inception-v3, and Xception. Lastly, training our networks on DCT-residuals was supposed to help our network focus on statistical image content rather than semantical image content. However, performance on DCT-residuals was at best similar to performance on original images. Our suggestions for continuation of this research are to experiment with different input sizes and other types of residual filters, collect larger (and higher quality) datasets with high variety, and to use interframe detection.

Files