Expanding Merlin-Arthur Classifiers: Interpretable Neural Networks Through Interactive Proof Systems ongoing
Existing approaches for interpreting Neural Network classifiers that highlight features relevant for a decision are based solely on heuristics. We introduce a theory that allows us to bound the quality of the features without assumptions on the classifier model by relating classification to Interactive Proof Systems.
🧑‍🎓 Project Members
🪙 Funding
This project is being funded by the Berlin Mathematics Research Center MATH+ (project ID EF1-24), itself funded by the German Research Foundation (DFG) under Germany's Excellence Strategy (EXC-2046/1, project ID 390685689) from April 2023 to March 2026.
🔬 Project Description
Safe deployment of Neural Network (NN) based AI systems in high-stakes applications, e.g., medical diagnostics or autonomous vehicles, requires that their reasoning be subject to human scrutiny. The field of Explainable AI (XAI) has thus put forth a number of interpretability approaches, among them saliency maps, mechanistic interpretability and self-interpreting networks. These methods operate without any theoretical guarantees, and it has been demonstrated that numerous XAI-methods can be manipulated by a clever design of the NN.
We introduce a novel framework that guarantees feature-based interpretability by linking classification to Interactive Proof Systems. We then extend our framework to real-world applications and challenges, both practically and theoretically. To tackle real world problems, we intend to use successful models such as neural networks or optimization algorithms such as Frank-Wolfe as provers and verifiers. This is possible as our approach does not depend on the specific structure of the involved agents.
đź’¬ Talks and posters
Poster presentations
- Jul 2024
- Unified Taxonomy in AI Safety: Watermarks, Adversarial Defenses, and Transferable Attacks by Berkant Turan
Workshop on Theoretical Foundations of Foundation Models (TF2M) @ ICML 2024, Vienna - May 2024
- Interpretability Guarantees with Merlin-Arthur Classifiers by Berkant Turan
27th AISTATS Conference, València - Mar 2024
- Group-wise Sparse and Explainable Adversarial Attacks by Shpresim Sadiku
Deep Learning: Theory, Applications, and Implications, RIKEN Center for Advanced Intelligence Project, Tokyo - Jul 2023
- Extending Merlin-Arthur Classifiers for Improved Interpretability by Berkant Turan
The 1st World Conference on eXplainable Artificial Intelligence
đź“ť Publications and preprints
Preprints
- GĹ‚uch, G., Turan, B., Nagarajan, S. G., and Pokutta, S. (2024). The Good, the Bad and the Ugly: Watermarks, Transferable Attacks and Adversarial Defenses.
[arXiv]
[BibTeX]
@misc{2024_GrzegorzTuranNagarajanPokutta_Watermarksadversarialdefenses, archiveprefix = {arXiv}, eprint = {2410.08864}, primaryclass = {cs.LG}, year = {2024}, author = {GĹ‚uch, Grzegorz and Turan, Berkant and Nagarajan, Sai Ganesh and Pokutta, Sebastian}, title = {The Good, the Bad and the Ugly: Watermarks, Transferable Attacks and Adversarial Defenses}, date = {2024-10-11} }
Conference proceedings
- Sadiku, S., Wagner, M., and Pokutta, S. (2023, November 29). Group-wise Sparse and Explainable Adversarial Attacks. Proceedings of the International Conference on Learning Representations.
[arXiv]
[BibTeX]
@inproceedings{2023_SadikuWagnerPokutta_Groupwisesparseattacks, year = {2025}, booktitle = {Proceedings of the International Conference on Learning Representations}, archiveprefix = {arXiv}, eprint = {2311.17434}, primaryclass = {cs.CV}, author = {Sadiku, Shpresim and Wagner, Moritz and Pokutta, Sebastian}, title = {Group-wise Sparse and Explainable Adversarial Attacks}, date = {2023-11-29} }
- Wäldchen, S., Sharma, K., Turan, B., Zimmer, M., and Pokutta, S. (2022, June 1). Interpretability Guarantees with Merlin-Arthur Classifiers. Proceedings of the International Conference on Artificial Intelligence and Statistics.
[arXiv]
[BibTeX]
@inproceedings{2022_WaeldchenEtAl_Interpretabilityguarantees, year = {2024}, booktitle = {Proceedings of the International Conference on Artificial Intelligence and Statistics}, archiveprefix = {arXiv}, eprint = {2206.00759}, primaryclass = {cs.LG}, author = {Wäldchen, Stephan and Sharma, Kartikey and Turan, Berkant and Zimmer, Max and Pokutta, Sebastian}, title = {Interpretability Guarantees with Merlin-Arthur Classifiers}, date = {2022-06-01} }