Yin, Xu and Jiuchuan, Jiang and Ge, Sheng and Gan, John Qiang and Wang, Haixian (2025) Aligning machines and minds: Neural encoding for high-level visual cortices based on image captioning task. Journal of Neural Engineering. DOI https://doi.org/10.1088/1741-2552/ae1164
Yin, Xu and Jiuchuan, Jiang and Ge, Sheng and Gan, John Qiang and Wang, Haixian (2025) Aligning machines and minds: Neural encoding for high-level visual cortices based on image captioning task. Journal of Neural Engineering. DOI https://doi.org/10.1088/1741-2552/ae1164
Yin, Xu and Jiuchuan, Jiang and Ge, Sheng and Gan, John Qiang and Wang, Haixian (2025) Aligning machines and minds: Neural encoding for high-level visual cortices based on image captioning task. Journal of Neural Engineering. DOI https://doi.org/10.1088/1741-2552/ae1164
Abstract
Objective. Neural encoding of visual stimuli aims to predict brain responses in the visual cortex to different external inputs. Deep neural networks (DNNs) trained on relatively simple tasks such as image classification have been widely applied in neural encoding studies of early visual areas. However, due to the complex and abstract nature of semantic representations in high-level visual cortices, their encoding performance and interpretability remain limited. Approach. We propose a novel neural encoding model guided by the image captioning task (ICT). During image captioning, an attention module is employed to focus on key visual objects. In the neural encoding stage, a flexible receptive field (RF) module is designed to simulate voxel-level visual fields. To bridge the domain gap between these two processes, we introduce the Atten-RF module, which effectively aligns attention-guided visual representations with voxel-wise brain activity patterns. Main results. Experiments on the large-scale Natural Scenes Dataset (NSD) demonstrate that our method achieves superior average encoding performance across seven high-level visual cortices, with a mean squared error (MSE) of 0.765, Pearson correlation coefficient (PCC) of 0.443, and coefficient of determination (R²) of 0.245. Significance. By leveraging the guidance and alignment provided by a complex vision-language task, our model enhances the prediction of voxel activity in high-level visual cortex, offering a new perspective on the neural encoding problem. Furthermore, various visualization techniques provide deeper insights into the neural mechanisms underlying visual information processing.
Item Type: | Article |
---|---|
Uncontrolled Keywords: | attention; deep neural network; functional magnetic resonance imaging; image caption task; neural encoding |
Subjects: | Z Bibliography. Library Science. Information Resources > ZR Rights Retention |
Divisions: | Faculty of Science and Health Faculty of Science and Health > Computer Science and Electronic Engineering, School of |
SWORD Depositor: | Unnamed user with email elements@essex.ac.uk |
Depositing User: | Unnamed user with email elements@essex.ac.uk |
Date Deposited: | 14 Oct 2025 08:59 |
Last Modified: | 15 Oct 2025 14:29 |
URI: | http://repository.essex.ac.uk/id/eprint/41728 |
Available files
Filename: Yin+et+al_2025_J._Neural_Eng._10.1088_1741-2552_ae1164.pdf
Licence: Creative Commons: Attribution 4.0