This study explores the effectiveness of the ConvNeXt model, an advanced computer vision architecture, in the task of image captioning. We integrated ConvNeXt with a Long Short-Term Memory network that includes a visual attention module, focusing on assessing its performance across different scenarios. Experiments were conducted using various ConvNeXt versions for feature extraction, different learning rates during the training phase were tested, and the impact of including or excluding teacherforcing was analyzed. The MS COCO 2014 dataset was employed, with top-5 accuracy and BLEU-n metrics used to evaluate performance. The implementation of ConvNeXt in image captioning systems reveals notable performance enhancements. In terms of BLEU-4 scores, ConvNeXt outperformed existing benchmarks by 43.04% for models using soft-attention and by 39.04% for those with hard-attention mechanisms. Furthermore, ConvNeXt surpassed models based on vision transformers and data-efficient image transformers by 4.57% and 0.93%, respectively, in BLEU-4 scores. When compared with systems using encoders such as ResNet-101,