Expanding Merlin-Arthur Classifiers: Interpretable Neural Networks Through Interactive Proof Systems ongoing

Existing approaches for interpreting Neural Network classifiers that highlight features relevant for a decision are based solely on heuristics. We introduce a theory that allows us to bound the quality of the features without assumptions on the classifier model by relating classification to Interactive Proof Systems.

🧑‍🎓 Project Members

Kartikey Sharma
Principal Investigator
kartikey.sharma (at) zib.de
Sebastian Pokutta
Principal Investigator
pokutta (at) zib.de
Stephan Wäldchen
Principal Investigator
waeldchen (at) zib.de
Berkant Turan
turan (at) zib.de

🪙 Funding

This project is being funded by the Berlin Mathematics Research Center MATH+ (project ID EF1-24), itself funded by the German Research Foundation (DFG) under Germany's Excellence Strategy (EXC-2046/1, project ID 390685689) from April 2023 to March 2026.

🔬 Project Description

Safe deployment of Neural Network (NN) based AI systems in high-stakes applications, e.g., medical diagnostics or autonomous vehicles, requires that their reasoning be subject to human scrutiny. The field of Explainable AI (XAI) has thus put forth a number of interpretability approaches, among them saliency maps, mechanistic interpretability and self-interpreting networks. These methods operate without any theoretical guarantees, and it has been demonstrated that numerous XAI-methods can be manipulated by a clever design of the NN. An illustration of the prover verifier setup with a cooperative prover (Merlin), an adversarial  prover (Morgana) and the verifier (Arthur). We introduce a novel framework that guarantees feature-based interpretability by linking classification to Interactive Proof Systems. We then extend our framework to real-world applications and challenges, both practically and theoretically. To tackle real world problems, we intend to use successful models such as neural networks or optimization algorithms such as Frank-Wolfe as provers and verifiers. This is possible as our approach does not depend on the specific structure of the involved agents.

đź’¬ Talks and posters

Conference and workshop talks

Jul 2024
Unified Taxonomy in AI Safety: Watermarks, Adversarial Defenses, and Transferable Attacks by Berkant Turan
Workshop on Theoretical Foundations of Foundation Models (TF2M) @ ICML 2024, Vienna
Jul 2023
Extending Merlin-Arthur Classifiers for Improved Interpretability by Berkant Turan
The 1st World Conference on EXplainable Artificial Intelligence

<đź“ť Publications and preprints

  1. Wäldchen, S., Sharma, K., Turan, B., Zimmer, M., and Pokutta, S. (2024). Interpretability Guarantees with Merlin-Arthur Classifiers. Proceedings of International Conference on Artificial Intelligence and Statistics. [arXiv]
    [BibTeX]
    @inproceedings{2022_WaeldchenEtAl_Interpretabilityguarantees,
      year = {2024},
      booktitle = {Proceedings of International Conference on Artificial Intelligence and Statistics},
      archiveprefix = {arXiv},
      eprint = {2206.00759},
      primaryclass = {cs.LG},
      author = {Wäldchen, Stephan and Sharma, Kartikey and Turan, Berkant and Zimmer, Max and Pokutta, Sebastian},
      title = {Interpretability Guarantees with Merlin-Arthur Classifiers}
    }
  2. GĹ‚uch, G., Turan, B., Nagarajan, S. G., and Pokutta, S. (2024). The Good, the Bad and the Ugly: Watermarks, Transferable Attacks and Adversarial Defenses. [arXiv]
    [BibTeX]
    @misc{2024_GrzegorzTuranNagarajanPokutta_Watermarksadversarialdefenses,
      archiveprefix = {arXiv},
      eprint = {2410.08864},
      primaryclass = {cs.LG},
      year = {2024},
      author = {GĹ‚uch, Grzegorz and Turan, Berkant and Nagarajan, Sai Ganesh and Pokutta, Sebastian},
      title = {The Good, the Bad and the Ugly: Watermarks, Transferable Attacks and Adversarial Defenses}
    }
  3. Sadiku, S., Wagner, M., and Pokutta, S. (2023). Group-wise Sparse and Explainable Adversarial Attacks. [arXiv]
    [BibTeX]
    @misc{2023_SadikuWagnerPokutta_Groupwisesparseattacks,
      archiveprefix = {arXiv},
      eprint = {2311.17434},
      primaryclass = {cs.CV},
      year = {2023},
      author = {Sadiku, Shpresim and Wagner, Moritz and Pokutta, Sebastian},
      title = {Group-wise Sparse and Explainable Adversarial Attacks}
    }