The caption generation is nothing but the generation of textual information from images. For this, objects from images are extracted and classified among predefined classes. The logical objects from the image are extracted and transformed into natural sentences. The recognizing process requires an iterative task that incorporates image recognition as well as machine vision. The process must define relations among objects, persons, and animals and create the textual description of these relations. The paper is about the study of deep learning techniques to discover, identify and produce good captions for a source image. The process of making explanations in the form of sentences for a source image is image captioning, which involves machine vision and natural sentence forming techniques. For these processes, recent models have used deep learning techniques to acquire a great improvement in performance. Second, a more advanced trend is set in utilizing attention-dependent structure for captioning. Recent interpreters use a process of attention for each produced term containing seen term and unseen term. However, these unseen terms are effortlessly detected by considering a model for language in the absence of taking seen indicators, but unseen words could cause and a give bad performance for visual captioning. Taking these issues into consideration, the hierarchy of LSTM [Long-Short-Term Memory] with adaptive attention approach for the creation of captions for images and videos is presented.