Video description is one of the most challenging task in the combined domain of computer vision and natural language processing. Captions for various open and constrained domain videos have been generated in the recent past but descriptions for driving dashcam videos have never been explored to the best of our knowledge. With the aim to explore dashcam video description generation for autonomous driving, this study presents DeepRide: a large-scale dashcam driving video description dataset for locationaware dense video description generation. The human-described dataset comprises visual scenes and actions with diverse weather, people, objects, and geographical paradigms. It bridges the autonomous driving domain with video description by textual description generation of the visual information as seen by a dashcam. We describe 16,000 videos (40 seconds each) in English employing 2,700 man-hours by two highly qualified teams with domain knowledge. The descriptions consist of eight to ten sentences covering each dashcam video's global features and event features in 60 to 90 words. The dataset consists of more than 130K sentences, totaling approximately one million words. We evaluate the dataset by employing location aware vision-language recurrent transformer framework to elaborate on the efficacy and significance of the visio-linguistics research for autonomous vehicles. We provided base line results to evaluate the dataset by employing three existing state-of-the-art recurrent models. The memory augmented transformer performed superior due to its highly summarized memory state for visual information and the sentence history while generating the trip description. Our proposed dataset opens a new dimension of diverse and exciting applications, such as self-driving vehicle reporting, driver and vehicle safety, inter-vehicle road intelligence sharing, and travel occurrence reports.
INDEX TERMS dashcam video description, video captioning, autonomous trip descriptionComprehending the localized events of a video appropriately and then transforming the attained visual understand-