Rapid expansion and the novel phenomenon of deep learning have manifested a variety of proposals and concerns in the area of video description, particularly in the recent past. Automatic event localization and textual alternatives generation for the complex and diverse visual data supplied in a video can be articulated as video description, bridging the two leading realms of computer vision and natural language processing. Several sequence-to-sequence algorithms are being proposed by splitting the task into two segments, namely encoding, i.e., getting and learning the insights of the visual representations, and decoding, i.e., transforming the learned representations to a sequence of words, one at a time. Implemented deep learning approaches have gained a lot of recognition for the reason of their superior computing capabilities and tremendous performance. However, the accomplishment of these algorithms strongly depends on the nature, diversity, and amount of data they are trained, validated and tested on. Techniques applied on insufficient and inadequate train/test data cannot deliver promising conclusions, consequently making it complicated to evaluate the quality of generated results. This survey focuses explicitly on the benchmark datasets, and evaluation metrics developed and deployed for video description tasks and their capabilities and limitations. Finally, we concluded with the need for essential enhancements and encouraging research directions on the topic.