Yin, Xu and Gan, John Q and Wang, Haixian (2026) Cogformer: A unified multi-scale brain representation for visual decoding and reconstruction from fMRI. IEEE Transactions on Medical Imaging, 45 (6). pp. 3021-3038. DOI https://doi.org/10.1109/tmi.2026.3667706
Yin, Xu and Gan, John Q and Wang, Haixian (2026) Cogformer: A unified multi-scale brain representation for visual decoding and reconstruction from fMRI. IEEE Transactions on Medical Imaging, 45 (6). pp. 3021-3038. DOI https://doi.org/10.1109/tmi.2026.3667706
Yin, Xu and Gan, John Q and Wang, Haixian (2026) Cogformer: A unified multi-scale brain representation for visual decoding and reconstruction from fMRI. IEEE Transactions on Medical Imaging, 45 (6). pp. 3021-3038. DOI https://doi.org/10.1109/tmi.2026.3667706
Abstract
With the rapid development of deep generative models (DGMs), the performance of decoding language and reconstructing images from Functional Magnetic Resonance Imaging (fMRI) has been improved. Nevertheless, the accurate representation of brain activity remains highly challenging, primarily due to the limited paired samples and the low signal-to-noise ratios of fMRI. To tackle these challenges, we introduce Cogformer, a unified multi-scale brain representation method. It is the first to learn brain representation from multi-scale fMRI activities via self-attention, and integrate a synchronized decoding and dynamic decoupling strategy for structural and semantic features through cross-attention. We conduct a systematic evaluation of Cogformer on the large-scale Natural Scenes Dataset (NSD) across a broad range of visual decoding tasks, including category classification, multi-label classification, image retrieval, image captioning, and image reconstruction. To the best of our knowledge, this represents the most extensive task coverage reported in related research. Cogformer achieves superior performance compared to a range of transformer-based baselines in category classification, multi-label classification, and image retrieval tasks. Moreover, in the more challenging tasks of image captioning and image reconstruction, Cogformer leverages a prior diffusion module to enhance the alignment with image semantics. This further improves the semantic consistency for caption generation and visual fidelity in image reconstruction. Across multiple evaluation metrics, Cogformer demonstrates competitive performance against existing state-of-the-art (SOTA) methods, highlighting its strong decoding capabilities and generalization potential.
| Item Type: | Article |
|---|---|
| Uncontrolled Keywords: | Functional Magnetic Resonance Imaging; Brain decoding; Transformer; Diffusion model |
| Subjects: | Z Bibliography. Library Science. Information Resources > ZR Rights Retention |
| Divisions: | Faculty of Science and Health Faculty of Science and Health > Computer Science and Electronic Engineering, School of |
| SWORD Depositor: | Unnamed user with email elements@essex.ac.uk |
| Depositing User: | Unnamed user with email elements@essex.ac.uk |
| Date Deposited: | 02 Mar 2026 13:55 |
| Last Modified: | 07 Jun 2026 09:46 |
| URI: | http://repository.essex.ac.uk/id/eprint/42875 |
Available files
Filename: FINAL VERSION.pdf