“…Later deep learning model has been introduced and it is proven in various research papers [42,43,44] as better model because the network learns itself. Recently, deep learning methods are rocking in this research area and itslowly transforms the annotation into captioning.In deep learning, annotation models can be classified as annotating with tags [10,11,14,20,29,29,34], finding sequence of words [5,9,13,15,22,24,25,27],captioning [17,18,19,26,21]and classification [6,7,14,16].In the first model, assigning tags to images based on extracted features or objects in the image. Related tags are identified and make it as a sequence of tags whereas the third model combined related tags with Natural Language processing as meaningful sentence called as captioning.In deep learning annotation models, Convolutional Neural Network (CNN) and Recurrent Neural Network (RNN) perform well to encode features of image and decode the features into the natural language representation [2].Later Long-Short-Term-Memory (LSTM) has been introduced to conserve the dependency for future reference and good in natural language generating [13] which RNN can't.…”