Expanding Merlin-Arthur Classifiers: Interpretable Neural Networks Through Interactive Proof Systems ongoing
Existing approaches for interpreting Neural Network classifiers that highlight features relevant for a decision are based solely on heuristics. We introduce a theory that allows us to bound the quality of the features without assumptions on the classifier model by relating classification to Interactive Proof Systems.
🧑‍🎓 Project Members
🪙 Funding
This project is being funded by the Berlin Mathematics Research Center MATH+ (project ID EF1-24), itself funded by the German Research Foundation (DFG) under Germany's Excellence Strategy (EXC-2046/1, project ID 390685689) from April 2023 to March 2026.
🔬 Project Description
Safe deployment of Neural Network (NN) based AI systems in high-stakes applications, e.g., medical diagnostics or autonomous vehicles, requires that their reasoning be subject to human scrutiny. The field of Explainable AI (XAI) has thus put forth a number of interpretability approaches, among them saliency maps, mechanistic interpretability and self-interpreting networks. These methods operate without any theoretical guarantees, and it has been demonstrated that numerous XAI-methods can be manipulated by a clever design of the NN. We introduce a novel framework that guarantees feature-based interpretability by linking classification to Interactive Proof Systems. We then extend our framework to real-world applications and challenges, both practically and theoretically. To tackle real world problems, we intend to use successful models such as neural networks or optimization algorithms such as Frank-Wolfe as provers and verifiers. This is possible as our approach does not depend on the specific structure of the involved agents.
đź’¬ Talks and posters
Conference and workshop talks
- Jul 2024
- Unified Taxonomy in AI Safety: Watermarks, Adversarial Defenses, and Transferable Attacks by Berkant Turan
Workshop on Theoretical Foundations of Foundation Models (TF2M) @ ICML 2024, Vienna - Jul 2023
- Extending Merlin-Arthur Classifiers for Improved Interpretability by Berkant Turan
The 1st World Conference on EXplainable Artificial Intelligence
<đź“ť Publications and preprints
- Wäldchen, S., Sharma, K., Turan, B., Zimmer, M., and Pokutta, S. (2024). Interpretability Guarantees with Merlin-Arthur Classifiers. Proceedings of International Conference on Artificial Intelligence and Statistics.
[arXiv]
[BibTeX]
- GĹ‚uch, G., Turan, B., Nagarajan, S. G., and Pokutta, S. (2024). The Good, the Bad and the Ugly: Watermarks, Transferable Attacks and Adversarial Defenses.
[arXiv]
[BibTeX]
- Sadiku, S., Wagner, M., and Pokutta, S. (2023). Group-wise Sparse and Explainable Adversarial Attacks.
[arXiv]
[BibTeX]