Explainable vision transformers with domain adaptation on limited datasets

Ali, Mohsin (2026) Explainable vision transformers with domain adaptation on limited datasets. Doctoral thesis, University of Essex. DOI https://doi.org/10.5526/ERR-00043108

Abstract

Artificial Intelligence (AI) is now widely used to analyse images in areas such as healthcare, automotive, retail, manufacturing, and security. However, many of todays deep learning models act as black boxes: they can be highly accurate, but it is not clear how they make their decisions. This lack of transparency is an issue when decisions are safety-critical, for example, in medical diagnosis. Explainable AI (XAI) helps address this by showing which parts of an image influenced a models prediction, so that we can check whether it is looking at the right features. This thesis focuses on improving Vision Transformers (ViTs), a powerful new type of model that looks at images in pieces (patches) and reasons about them using attention mechanisms. ViTs work well with large datasets, but they struggle when data is limited, which is often the case in the healthcare domain. ViTs are also vulnerable to adversarial attacks, where tiny invisible changes to an image can cause wrong predictions. To tackle these issues, four main contributions are made. First, a feature-map fusion method is introduced for Convolutional Neural Networks (CNNs), combining information from clean, noisy, and perturbed images to make models more robust. Second, two lightweight improvements to ViTs are proposed: the Summary Vision Transformer (S-ViT), which adds extra spatial information from a CNN, and the Multi-Gradient Image Transformer (MGiT), which stabilises training using an auxiliary transformer. Both methods improve performance on small and imbalanced datasets, such as skin lesion images (ISIC 2017) and COVID-19 chest X-rays. Third, XAI tools, including LIME, SHAP, Grad-CAM, and Attention Rollout, are used to confirm that these models focus on clinically meaningful regions. Finally, a new explanation method, FocusViT, is proposed to give sharper and more faithful explanations of ViT predictions.

Item Metadata

Item Type:	Thesis (Doctoral)
Subjects:	Q Science > QA Mathematics > QA75 Electronic computers. Computer science
Divisions:	Faculty of Science and Health > Computer Science and Electronic Engineering, School of
Depositing User:	Mohsin Ali
Date Deposited:	13 Apr 2026 10:28
Last Modified:	13 Apr 2026 10:28
URI:	http://repository.essex.ac.uk/id/eprint/43108

Available files

UNSPECIFIED

Filename: New-Mohsin_PhD_Thesis.pdf

Download

Explainable vision transformers with domain adaptation on limited datasets

Abstract

Item Metadata

Share and export

Available files

UNSPECIFIED

Statistics

Altmetrics

Downloads