Ali, Mohsin and Raza, Haider and Gan, John and Haris, Muhammad (2025) Optimising Vision Transformer Performance on Limited Datasets: A Multi-Gradient Approach. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2025-06-11 - 2025-06-15, Nashville, USA.
Ali, Mohsin and Raza, Haider and Gan, John and Haris, Muhammad (2025) Optimising Vision Transformer Performance on Limited Datasets: A Multi-Gradient Approach. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2025-06-11 - 2025-06-15, Nashville, USA.
Ali, Mohsin and Raza, Haider and Gan, John and Haris, Muhammad (2025) Optimising Vision Transformer Performance on Limited Datasets: A Multi-Gradient Approach. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2025-06-11 - 2025-06-15, Nashville, USA.
Abstract
Vision Transformers (ViTs) are well-known for capturing the global context of images using Multi-head Self-Attention (MHSA). However, compared to Convolutional Neural Networks (CNNs), ViTs typically exhibit a reduced inductive bias and require a larger volume of training image data to learn local feature representations. While various methods like the integration of CNN features or advanced pre-training strategies have been proposed to introduce this inductive bias, they often require significant architectural modifications or rely heavily on expansive pre-training datasets. This paper introduces a novel approach for training ViTs on limited datasets without altering the ViT architecture. We propose the Multi-Gradient Image Transformer (MGiT), which utilizes a parallel training method with a compact auxiliary ViT to adaptively optimize the weights of the target ViT. This approach yields significant performance improvements across diverse datasets and training scenarios. Our findings demonstrate that MGiT enhances ViT efficiency more effectively than traditional training methods. Furthermore, the application of Jensen-Shannon (JS) Divergence validates the convergence and alignment of feature understanding between the primary and auxiliary ViTs, thereby stabilizing the training process. The code is available at https://github.com/game-sys/Multi-Gradient-Image-Transformer-MGiT-
| Item Type: | Conference or Workshop Item (Paper) |
|---|---|
| Uncontrolled Keywords: | Training, Computer vision, Conferences, Transfer learning, Refining, Computer architecture, Transformers, Pattern recognition, Convolutional neural networks, Convergence |
| Subjects: | Z Bibliography. Library Science. Information Resources > ZR Rights Retention |
| Divisions: | Faculty of Science and Health > Computer Science and Electronic Engineering, School of |
| SWORD Depositor: | Unnamed user with email elements@essex.ac.uk |
| Depositing User: | Unnamed user with email elements@essex.ac.uk |
| Date Deposited: | 03 Jun 2026 15:23 |
| Last Modified: | 03 Jun 2026 15:23 |
| URI: | http://repository.essex.ac.uk/id/eprint/40678 |
Available files
Filename: MGiT_cameraReady.pdf
Licence: Creative Commons: Attribution 4.0