<strong>Paper Title</strong><br>

Neural Narratives: Revolutionizing Image Captioning with Transformers<br>

<br>


<strong>Abstract</strong><br>

Image captioning is the task of producing humanlike text narratives for images, and it has come a long way largely due to deep learning methods. This paper introduces ”Neural Narratives,” a new framework for image captioning based upon the use of transformer-based architectures for improved quality and coherence of captions. The model combines a pre-trained ResNet-50 convolutional neural network for efficient feature extraction with a transformer decoder specifically constructed to learn visual-sematic associations and produce sequential text. Our training is performed via an extensive pipeline consisting of beam search decoding, gradient clipping, and a step-wise learning rate scheduler to facilitate stable and effective learning. The applicability of our framework is demonstrated and evaluated on a common dataset for image captioning, COCO 2017, where we report comparable BLEU scores and high accuracies at the token level, indicating our model produces fluent and contextually appropriate text in response to images. In addition to the implications for the present state of the art in image captioning, our proposed framework motivates research possibilities for assistive technology, accessibility of content, and analysis of multimedia. Our work further shows the considerable potential of transformer models in integrating vision-based and languagebased systems for entertainment, offering a flexible and scalable method for generating human-like narratives from images.

Keywords - ResNet-50, Image Captioning, Convolutional neural network, COCO 2017 Dataset, BLEU Score, CIDEr Metric, Adam Optimizer, Language Integration, Beam Search, Transformer Decoder