This research investigates the viability of Long Short-Term Memory (LSTM) systems, a subtype of Recurrent Neural Networks (RNNs), for picture captioning. Leveraging the MS COCO dataset, the study compares the execution of LSTM-based RNNs with Vanilla RNN, Gated Recurrent Unit (GRU), consideration components, and transformer-based models. Experimental comes about to illustrate that the LSTM-based RNN shows competitive execution, accomplishing a BLEU-4 score of 0.72, a METEOR score of 0.68, and a CIDEr score of 2.1. The comparative investigation uncovers its prevalence over Vanilla RNN and GRU, highlighting its capability to capture long-range conditions inside successive picture information. Moreover, the study investigates the effect of consideration instruments and transformer designs, exhibiting their potential improvements in the context-aware caption era. The transformer-based show outflanks all other models, accomplishing a BLEU-4 score of 0.78, a METEOR score of 0.72, and a CIDEr score of 2.5. The findings give important bits of knowledge toward the creating scene of picture captioning strategy, which makes LSTM-based RNNs solid and productive approaches for capturing worldly groupings in visual substance. In achieving these, the study provides a framework for future developments in hybrid models and manufacturing processes that push boundaries of smart image perception and understanding