In image captioning, we generate visual descriptions from an image. Image Cap-tioning requires identifying the key entity, feature, and association in an image. There is also a requirement to generate captions that are syntactically and semantically correct. The process of image captioning requires computer vision and natural language processing. In the past few decades, a substantial attempt has been made to generate the caption for images. In this survey article, we are going to present an extensive survey on image captioning for Indian Languages. To summarize recent research work in image captioning, first, we briefly review the traditional approach to image captioning depending on template and retrieval. Further deep-learning approaches for image captioning are concentrated which are classified as encoder-decoder architecture, attention-based approach, and transformer architecture. Our main focus in this survey is based on image cap-tioning techniques for Indian languages like Hindi, Bengali Assamese, etc. After that, we analyze the state-of-the-art approach on the most widely dataset i.e. MS COCO dataset with their strengths, limitations, and performance metrics i.e. BLEU, ROUGE, METEOR, CIDEr, SPICE. At last, we explore discussion on open challenges and future direction in the field of image captioning.