The difficult interdisciplinary endeavour of creating insightful and detailed captions for images is known as "image captioning," and it lies at the nexus of computer vision and natural language processing. We give a comprehensive examination of datasets, evaluation metrics, and image captioning models in this paper. We present a thorough review of popular image captioning models, from conventional methods to the most recent developments utilizing deep learning and attention mechanisms. We examine the design, underlying assumptions, and capabilities of these models, emphasizing how they help produce logical and contextually appropriate captions. Furthermore, we analyses in detail well-known evaluation metrics as BLEU, METEOR, ROUGE, and CIDEr, clarifying their importance in evaluating generated caption quality against ground truth references. Additionally, we talk about the critical role that datasets play in image captioning research, with particular attention to prominent datasets like as COCO, Flickr30k, and Conceptual Captions. We investigate these dataset's diversity, volume, and annotations, emphasizing their impact on model evaluation and training. Our objective is to furnish scholars, professionals, and amateurs with an invaluable tool for comprehending the state of image captioning, so facilitating the creation of inventive models and enhanced assessment techniques.