Image captioning is systematically generating the caption of the image with a sentence description. In the past few years, the automatic process of creating image caption has fascinated the great interest in Artificial Intelligence (AI) field. Image captioning defines as the basic process of building the conjunction of image processing and natural language processing at input and output position. All image processing tasks, such as the segmentation of image, object tracking, object detection, image recognition, and many others, are mostly performed using Convolutional Neural Networks (CNNs). To perform the natural language processing tasks, just as semantic role labelling, neural machine translation, speech recognition, question and answering, and many others, Recurrent Neural Networks (RNNs) and long-term memory networks (LSTMs) are essential for some of the biggest breakthroughs. This paper proposes the efficient encoder-decoder framework-based image captioning model, namely Two-Tier LSTM (TT-LSTM) Model. The TT-LSTM model is designedly implemented upon the encoder-decoder framework with two LSTM layers. This research is implemented on the MSCOCO, Flickr30k, and Flickr8k datasets; and evaluated with standard evaluation matrixes such as ROUGE-L, CIDEr, and four BLEU scores. The outcomes of the experiments on the typical datasets reveal that the proposed model generates meaningful natural language sentences. The proposed model also improves the sentence generation efficiency and can achieve better performance for image caption generation.