Leveraging Feature Extraction to Detect Adversarial Examples
Let's Meet in the Middle
More Info
expand_more
Abstract
Previous research has explored the detection of adversarial examples with dimensional reduction and Out-of-Distribution (OOD) recognition. However, these approaches are not effective against white-box adversarial attacks. Moreover, recent OOD methods that utilize hidden units hinder the scalability of the target model.
For that reason, various explanations of adversarial examples are studied to get a better understanding about its properties and anomalies. Furthermore, we discuss the added value of using natural scene statistics and utility functions to improve the relevance of the features for detection. By utilizing the anomalies we identified for adversarial examples in an ensemble, this thesis is the first to propose a robust solution for adaptive and white-box attacks.
Particularly, we address these challenges with MeetSafe. A Gaussian Mixture Model that leverages principal component analysis, feature squeezing, and density estimation to detect adaptive white-box adversaries. Furthermore, our enhanced Local Reachability Density (LRD) algorithm further improves the efficiency of state-of-the-art OOD methods. In particular, the proposed LRD enhances scalability by feature bagging hidden units with large absolute Z-scores. We then show that predictors, including LRD, are far more effective in ensembles like MeetSafe which supports prior conjectures that a range of different heuristics may further constrain adversaries when combined.
Extensive experiments on 14 models show that MeetSafe detects adaptive perturbations with an accuracy of 62% on STL-10, 75% on CIFAR-10, and 99% on MNIST using either adversarial training or Reverse Cross Entropy (RCE), achieving an improvement of at least 8.1% for each evaluated method by averaging across the three datasets.