Are Neural Networks Robust to Gradient-Based Adversaries Also More Explainable? Evidence from Counterfactuals

Appachi Senthilkumar, R.

Are Neural Networks Robust to Gradient-Based Adversaries Also More Explainable? Evidence from Counterfactuals

Bachelor thesis (2024)

Authors

R. Appachi Senthilkumar Electrical Engineering, Mathematics and Computer Science

Contributors

Patrick Altmeyer (mentor)

Cynthia C.S. Liem (mentor)

Bernd Dudzik Pattern Recognition and Bioinformatics (graduation committee member)

Faculty

Electrical Engineering, Mathematics and Computer Science, Electrical Engineering, Mathematics and Computer Science

Explainable AI Adversarial Machine Learning Adversarial robustness

To reference this document use:

http://resolver.tudelft.nl/uuid:47786bb4-ae24-4972-94a0-1bd18d756486

More Info

expand_more

Published Date

27-06-2024

Language

English

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Faculty

Electrical Engineering, Mathematics and Computer Science

Abstract

Adversarial Training has emerged as the most reliable technique to make neural networks robust to gradient-based adversarial perturbations on input data. Besides improving model robustness, preliminary evidence presents an interesting consequence of adversarial training -- increased explainability of model behaviour. Prior work has explored the effects of adversarial training on gradient stability and interpretability, as well as visual explainability of counterfactuals. Our work presents the first quantitative, empirical analysis of the impact of model robustness on model explainability by comparing the plausibility of faithful counterfactuals for both robust and standard networks. We seek to determine whether robust networks learn more plausible decision boundaries and representations of the data than regular models, and whether the strength of the adversary used to train robust models affects their explainability. Our finidngs indicate that robust networks for image data learn more explainable decision boundaries and representations of data than regular models, with more robust models producing more plausible counterfactuals. Robust models for tabular data, however, only conclusively exhibit this phenomenon along decision boundaries and not for its overall data representations, possibly due to its high robustness-accuracy trade-off and the difficulties associated with traditional adversarial training due to its innate properties. We believe our work can help guide future research towards improving the robustness of machine learning models keeping their explainability in mind.

Files

Rithik_Bachelor_Thesis.pdf

(pdf | 0.553 Mb)

Unknown license