Expanding Merlin-Arthur Classifiers: Interpretable Neural Networks Through Interactive Proof Systems ongoing

Existing approaches for interpreting Neural Network classifiers that highlight features relevant for a decision are based solely on heuristics. We introduce a theory that allows us to bound the quality of the features without assumptions on the classifier model by relating classification to Interactive Proof Systems.

🧑‍🎓 Project Members

Kartikey Sharma
Principal Investigator
kartikey.sharma (at) zib.de
Sebastian Pokutta
Principal Investigator
pokutta (at) zib.de
Stephan Wäldchen
Principal Investigator
waeldchen (at) zib.de
Berkant Turan
turan (at) zib.de

🪙 Funding

This project is being funded by the Berlin Mathematics Research Center MATH+ (project ID EF1-24), itself funded by the German Research Foundation (DFG) under Germany's Excellence Strategy (EXC-2046/1, project ID 390685689) from April 2023 to March 2026.

🔬 Project Description

Safe deployment of Neural Network (NN) based AI systems in high-stakes applications, e.g., medical diagnostics or autonomous vehicles, requires that their reasoning be subject to human scrutiny. The field of Explainable AI (XAI) has thus put forth a number of interpretability approaches, among them saliency maps, mechanistic interpretability and self-interpreting networks. These methods operate without any theoretical guarantees, and it has been demonstrated that numerous XAI-methods can be manipulated by a clever design of the NN. An illustration of the prover verifier setup with a cooperative prover (Merlin), an adversarial  prover (Morgana) and the verifier (Arthur). We introduce a novel framework that guarantees feature-based interpretability by linking classification to Interactive Proof Systems. We then extend our framework to real-world applications and challenges, both practically and theoretically. To tackle real world problems, we intend to use successful models such as neural networks or optimization algorithms such as Frank-Wolfe as provers and verifiers. This is possible as our approach does not depend on the specific structure of the involved agents.

đź’¬ Talks and posters

Conference and workshop talks

Jul 2024
Unified Taxonomy in AI Safety: Watermarks, Adversarial Defenses, and Transferable Attacks by Berkant Turan
Workshop on Theoretical Foundations of Foundation Models (TF2M) @ ICML 2024, Vienna
Jul 2023
Extending Merlin-Arthur Classifiers for Improved Interpretability by Berkant Turan
The 1st World Conference on EXplainable Artificial Intelligence

đź“ť Publications and preprints

  1. Wäldchen, S., Sharma, K., Turan, B., Zimmer, M., and Pokutta, S. (2024). Interpretability Guarantees with Merlin-Arthur Classifiers. Proceedings of International Conference on Artificial Intelligence and Statistics. [arXiv]
    [BibTeX]
    @inproceedings{2022_WaeldchenEtAl_Interpretabilityguarantees,
      year = {2024},
      booktitle = {Proceedings of International Conference on Artificial Intelligence and Statistics},
      archiveprefix = {arXiv},
      eprint = {2206.00759},
      primaryclass = {cs.LG},
      author = {Wäldchen, Stephan and Sharma, Kartikey and Turan, Berkant and Zimmer, Max and Pokutta, Sebastian},
      title = {Interpretability Guarantees with Merlin-Arthur Classifiers}
    }
  2. GĹ‚uch, G., Turan, B., Nagarajan, S. G., and Pokutta, S. (2024). The Good, the Bad and the Ugly: Watermarks, Transferable Attacks and Adversarial Defenses. [arXiv]
    [BibTeX]
    @misc{2024_GrzegorzTuranNagarajanPokutta_Watermarksadversarialdefenses,
      archiveprefix = {arXiv},
      eprint = {2410.08864},
      primaryclass = {cs.LG},
      year = {2024},
      author = {GĹ‚uch, Grzegorz and Turan, Berkant and Nagarajan, Sai Ganesh and Pokutta, Sebastian},
      title = {The Good, the Bad and the Ugly: Watermarks, Transferable Attacks and Adversarial Defenses}
    }
  3. Sadiku, S., Wagner, M., and Pokutta, S. (2023). Group-wise Sparse and Explainable Adversarial Attacks. [arXiv]
    [BibTeX]
    @misc{2023_SadikuWagnerPokutta_Groupwisesparseattacks,
      archiveprefix = {arXiv},
      eprint = {2311.17434},
      primaryclass = {cs.CV},
      year = {2023},
      author = {Sadiku, Shpresim and Wagner, Moritz and Pokutta, Sebastian},
      title = {Group-wise Sparse and Explainable Adversarial Attacks}
    }