Expanding Merlin-Arthur Classifiers: Interpretable Neural Networks Through Interactive Proof Systems ongoing

Existing approaches for interpreting Neural Network classifiers that highlight features relevant for a decision are based solely on heuristics. We introduce a theory that allows us to bound the quality of the features without assumptions on the classifier model by relating classification to Interactive Proof Systems.

MATH+ EF1-24

Apr 2023 to Mar 2026

🧑‍🎓 Project Members

Kartikey Sharma
Principal Investigator
kartikey.sharma (at) zib.de

Sebastian Pokutta
Principal Investigator
pokutta (at) zib.de

Stephan Wäldchen
Principal Investigator
waeldchen (at) zib.de

Berkant Turan
turan (at) zib.de

🪙 Funding

This project is being funded by the Berlin Mathematics Research Center MATH+ (project ID EF1-24), itself funded by the German Research Foundation (DFG) under Germany's Excellence Strategy (EXC-2046/1, project ID 390685689) from April 2023 to March 2026.

🔬 Project Description

Safe deployment of Neural Network (NN) based AI systems in high-stakes applications, e.g., medical diagnostics or autonomous vehicles, requires that their reasoning be subject to human scrutiny. The field of Explainable AI (XAI) has thus put forth a number of interpretability approaches, among them saliency maps, mechanistic interpretability and self-interpreting networks. These methods operate without any theoretical guarantees, and it has been demonstrated that numerous XAI-methods can be manipulated by a clever design of the NN. An illustration of the prover verifier setup with a cooperative prover (Merlin), an adversarial prover (Morgana) and the verifier (Arthur). We introduce a novel framework that guarantees feature-based interpretability by linking classification to Interactive Proof Systems. We then extend our framework to real-world applications and challenges, both practically and theoretically. To tackle real world problems, we intend to use successful models such as neural networks or optimization algorithms such as Frank-Wolfe as provers and verifiers. This is possible as our approach does not depend on the specific structure of the involved agents.

💬 Talks and posters

Poster presentations

May 2025: The Good, the Bad and the Ugly: Watermarks, Transferable Attacks and Adversarial Defenses by Berkant Turan
7th DOxML Conference, Kyoto
Apr 2025: Group-wise Sparse and Explainable Adversarial Attacks by Shpresim Sadiku
13th International Conference on Learning Representations (ICLR), Singapore
Apr 2025: The Good, the Bad and the Ugly: Watermarks, Transferable Attacks and Adversarial Defenses by Berkant Turan
Workshop on GenAI Watermarking @ ICLR 2025, Singapore
Jul 2024: Unified Taxonomy in AI Safety: Watermarks, Adversarial Defenses, and Transferable Attacks by Berkant Turan
Workshop on Theoretical Foundations of Foundation Models (TF2M) @ ICML 2024, Vienna
May 2024: Interpretability Guarantees with Merlin-Arthur Classifiers by Berkant Turan
27th AISTATS Conference, València
Mar 2024: Group-wise Sparse and Explainable Adversarial Attacks by Shpresim Sadiku
Deep Learning: Theory, Applications, and Implications, RIKEN Center for Advanced Intelligence Project, Tokyo
Jul 2023: Extending Merlin-Arthur Classifiers for Improved Interpretability by Berkant Turan
The 1st World Conference on eXplainable Artificial Intelligence

View More / Less

📝 Publications and preprints

Conference proceedings

Sadiku, S., Wagner, M., and Pokutta, S. (2025). Group-wise Sparse and Explainable Adversarial Attacks. Proceedings of the International Conference on Learning Representations. [arXiv]

[BibTeX]

@inproceedings{2023_SadikuWagnerPokutta_Groupwisesparseattacks,
  year = {2025},
  booktitle = {Proceedings of the International Conference on Learning Representations},
  archiveprefix = {arXiv},
  eprint = {2311.17434},
  primaryclass = {cs.CV},
  author = {Sadiku, Shpresim and Wagner, Moritz and Pokutta, Sebastian},
  title = {Group-wise Sparse and Explainable Adversarial Attacks}
}

Głuch, G., Turan, B., Nagarajan, S. G., and Pokutta, S. (2025). The Good, the Bad and the Ugly: Watermarks, Transferable Attacks and Adversarial Defenses. Proceedings of the ICLR Workshop on GenAI Watermarking. [arXiv]

[BibTeX]

@inproceedings{2024_GrzegorzTuranNagarajanPokutta_Watermarksadversarialdefenses,
  year = {2025},
  booktitle = {Proceedings of the ICLR Workshop on GenAI Watermarking},
  archiveprefix = {arXiv},
  eprint = {2410.08864},
  primaryclass = {cs.LG},
  author = {Głuch, Grzegorz and Turan, Berkant and Nagarajan, Sai Ganesh and Pokutta, Sebastian},
  title = {The Good, the Bad and the Ugly: Watermarks, Transferable Attacks and Adversarial Defenses}
}

Wäldchen, S., Sharma, K., Turan, B., Zimmer, M., and Pokutta, S. (2024). Interpretability Guarantees with Merlin-Arthur Classifiers. Proceedings of the International Conference on Artificial Intelligence and Statistics. [arXiv]

[BibTeX]

@inproceedings{2022_WaeldchenEtAl_Interpretabilityguarantees,
  year = {2024},
  booktitle = {Proceedings of the International Conference on Artificial Intelligence and Statistics},
  archiveprefix = {arXiv},
  eprint = {2206.00759},
  primaryclass = {cs.LG},
  author = {Wäldchen, Stephan and Sharma, Kartikey and Turan, Berkant and Zimmer, Max and Pokutta, Sebastian},
  title = {Interpretability Guarantees with Merlin-Arthur Classifiers}
}