Abstract:
This paper focuses on creating a model that works better in our native tongue for generation captions from the images because it would be applied to any website or app and is easy to use even by blind people. Moreover, attention has been drawn to the usage of RESNET-152, a deep neural network with 152 layers of depth, as an encoder for Bengali captioning problems. As there hasn't been any research on adopting this approach with the Bangladeshi dataset, we try to create a Bangla Captions dataset. Our proposed model is a transfer learning-based approach that gives state-of-the-art performance on our dataset. For accurate features, we employed five CNN architectures: VGG19, VGG-16, ResNet 50, ResNet 101, and ResNet 152, with a caption-model made up of a BI-LSTM. By applying this hybrid model on our dataset we achieved a good outcome. Experimental results demonstrate that the models outperform the results of previous research and that the accuracy is acceptable with a BLEU-I score of 88.18 when the encoder is ResNet-152.