Ali, Mohsin and Raza, Haider and Gan, John and Khan, Muhammad Haris (2026) FocusViT: Faithful Explanations for Vision Transformers via Gradient-Guided Layer-Skipping. In: International Conference on Artificial Intelligence and Statistics, 2026-05-02 - 2026-05-05, Tangier, Morocco. (In Press)
Ali, Mohsin and Raza, Haider and Gan, John and Khan, Muhammad Haris (2026) FocusViT: Faithful Explanations for Vision Transformers via Gradient-Guided Layer-Skipping. In: International Conference on Artificial Intelligence and Statistics, 2026-05-02 - 2026-05-05, Tangier, Morocco. (In Press)
Ali, Mohsin and Raza, Haider and Gan, John and Khan, Muhammad Haris (2026) FocusViT: Faithful Explanations for Vision Transformers via Gradient-Guided Layer-Skipping. In: International Conference on Artificial Intelligence and Statistics, 2026-05-02 - 2026-05-05, Tangier, Morocco. (In Press)
Abstract
Vision Transformers (ViTs) have emerged as powerful alternatives to CNNs for various vision tasks, yet their token-based, attention-driven architecture makes interpreting their predictions challenging. Existing explainability methods, such as Grad-CAM and Attention Rollout, either fail to capture hierarchical semantic information or assume atten tion directly reflects importance, often lead- ing to misleading explanations. We propose FocusViT, a novel explainability framework that integrates gradient-weighted attention attribution with dynamic, faithfulness-driven layer aggregation. By fusing attention maps with class-specific gradients and introducing per-head dynamic weighting, FocusViT highlights not only where the model attends but also how sensitive the prediction is to those attentions. Furthermore, our adaptive layer-skipping strategy ensures that only semantically meaningful layers contribute to the final explanation, enhancing both faithfulness and clarity. Extensive quantitative and qualitative evaluations on diverse benchmarks demonstrate that FocusViT improves over existing methods in faithfulness and sparsity, achieving competitive robustness and class sensitivity, and provides sharper, more reliable visual explanations for ViTs. The official implementation is publicly available at: https://github.com/game-sys/focusvit- aistats2026.git
| Item Type: | Conference or Workshop Item (Paper) |
|---|---|
| Divisions: | Faculty of Science and Health Faculty of Science and Health > Computer Science and Electronic Engineering, School of |
| SWORD Depositor: | Unnamed user with email elements@essex.ac.uk |
| Depositing User: | Unnamed user with email elements@essex.ac.uk |
| Date Deposited: | 26 Jan 2026 12:57 |
| Last Modified: | 26 Jan 2026 12:57 |
| URI: | http://repository.essex.ac.uk/id/eprint/42653 |
Available files
Filename: AISTATS2026_AcceptedVersion.pdf