Optimising Vision Transformer Performance on Limited Datasets: A Multi-Gradient Approach

Ali, Mohsin and Raza, Haider and Gan, John and Haris, Muhammad (2025) Optimising Vision Transformer Performance on Limited Datasets: A Multi-Gradient Approach. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2025-06-11 - 2025-06-15, Nashville, USA.

Abstract

Vision Transformers (ViTs) are well-known for capturing the global context of images using Multi-head Self-Attention (MHSA). However, compared to Convolutional Neural Networks (CNNs), ViTs typically exhibit a reduced inductive bias and require a larger volume of training image data to learn local feature representations. While various methods like the integration of CNN features or advanced pre-training strategies have been proposed to introduce this inductive bias, they often require significant architectural modifications or rely heavily on expansive pre-training datasets. This paper introduces a novel approach for training ViTs on limited datasets without altering the ViT architecture. We propose the Multi-Gradient Image Transformer (MGiT), which utilizes a parallel training method with a compact auxiliary ViT to adaptively optimize the weights of the target ViT. This approach yields significant performance improvements across diverse datasets and training scenarios. Our findings demonstrate that MGiT enhances ViT efficiency more effectively than traditional training methods. Furthermore, the application of Jensen-Shannon (JS) Divergence validates the convergence and alignment of feature understanding between the primary and auxiliary ViTs, thereby stabilizing the training process. The code is available at https://github.com/game-sys/Multi-Gradient-Image-Transformer-MGiT-

Item Metadata

Item Type:	Conference or Workshop Item (Paper)
Uncontrolled Keywords:	Training, Computer vision, Conferences, Transfer learning, Refining, Computer architecture, Transformers, Pattern recognition, Convolutional neural networks, Convergence
Subjects:	Z Bibliography. Library Science. Information Resources > ZR Rights Retention
Divisions:	Faculty of Science and Health > Computer Science and Electronic Engineering, School of
SWORD Depositor:	Unnamed user with email elements@essex.ac.uk
Depositing User:	Unnamed user with email elements@essex.ac.uk
Date Deposited:	03 Jun 2026 15:23
Last Modified:	03 Jun 2026 15:23
URI:	http://repository.essex.ac.uk/id/eprint/40678

Available files

Accepted Version

Filename: MGiT_cameraReady.pdf

Licence: Creative Commons: Attribution 4.0

Download

Optimising Vision Transformer Performance on Limited Datasets: A Multi-Gradient Approach

Abstract

Item Metadata

Share and export

Available files

Accepted Version

Statistics

Altmetrics

Downloads