Image captioning is a sentence summarizing the semantic content of an image. Describing the essential information of an image in the form of a statement typically deals with computer vision and natural language processing. The prior research frequently addressed this subject simply using standard machine learning approaches that model such systems by extracting the hand-engineered characteristics from the input data. With the resurgence of deep-learning approaches, the image captioning area has experienced state-of-the-art outcomes in numerous application domains, including visual aid, traffic assistance, medical support, and remote sensing. This research thoroughly reviews the application domain-based image captioning frameworks by addressing both classical and modern deep learning-based architectures. We further propose a taxonomy to characterize the same by studying their distinct technological foundations and performances across diverse datasets. We also describe evaluation metrics used to examine the captioning image models across several domains. Finally, we highlight open research problems concerning such models from a futuristic standpoint.