Jindal, Kritika and Agarwal, Govind and Chowdhury, Abishi and Singh, Vishal Krishna and Ullah, Rahmat and Ur Rehman, Mujeeb and Pal, Amrit (2025) Deep Learning Based Captioning of Toys in a Smart Monitoring System. In: 2025 International Joint Conference on Neural Networks (IJCNN), 2025-06-30 - 2025-07-05, Rome.
Jindal, Kritika and Agarwal, Govind and Chowdhury, Abishi and Singh, Vishal Krishna and Ullah, Rahmat and Ur Rehman, Mujeeb and Pal, Amrit (2025) Deep Learning Based Captioning of Toys in a Smart Monitoring System. In: 2025 International Joint Conference on Neural Networks (IJCNN), 2025-06-30 - 2025-07-05, Rome.
Jindal, Kritika and Agarwal, Govind and Chowdhury, Abishi and Singh, Vishal Krishna and Ullah, Rahmat and Ur Rehman, Mujeeb and Pal, Amrit (2025) Deep Learning Based Captioning of Toys in a Smart Monitoring System. In: 2025 International Joint Conference on Neural Networks (IJCNN), 2025-06-30 - 2025-07-05, Rome.
Abstract
The domain of image captioning has attracted increased interest in recent times due to advancements in computer vision technology and the incorporation of deep learning models, specifically convolutional neural networks (CNNs) and recurrent neural networks (RNNs). These developments empower the creation of more precise and contextually comprehensive descriptions of images. This research aims to adapt deep learning to address the challenge of image captioning particularly for toys. A new dataset is curated in the research by sourcing copyright free images from websites featuring diverse categories of toys. Through augmentation techniques, the images are enhanced to promote dataset generalization and robustness, culminating in a comprehensive collection of images spanning distinct classes, each meticulously annotated with manually crafted captions. Feature extraction was performed using pre-trained VGG16, DenseNet201, ResNet50, and ResNet101. These models were fine-tuned to achieve optimal performance on the collected dataset. The language model utilized was LSTM. For extending the image captioning methodology to video captioning, YOLO was implemented to detect objects within video frames. Additionally, to assist visually impaired children and create a more inclusive environment, the captions were translated to audio using Google Text-to-Speech. The approach was evaluated with BLEU score and ResNet101+LSTM yielded the highest BLEU-1 score of 0.975825 outperforming the other proposed approaches.
| Item Type: | Conference or Workshop Item (Paper) |
|---|---|
| Uncontrolled Keywords: | Deep learning; Image captioning; Video captioning; CNN; LSTM |
| Subjects: | Z Bibliography. Library Science. Information Resources > ZR Rights Retention |
| Divisions: | Faculty of Science and Health > Computer Science and Electronic Engineering, School of |
| SWORD Depositor: | Unnamed user with email elements@essex.ac.uk |
| Depositing User: | Unnamed user with email elements@essex.ac.uk |
| Date Deposited: | 02 Dec 2025 12:58 |
| Last Modified: | 02 Dec 2025 12:58 |
| URI: | http://repository.essex.ac.uk/id/eprint/42194 |
Available files
Filename: Deep_Learning_Based_Captioning_of_Toys_in_a_Smart_Monitoring_System.pdf
Licence: Creative Commons: Attribution 4.0