In today's age, a massive amount of videos are produced every day, which contains audio, visual, and textual data. This constant increase is due to the ease of recording service in portable devices, such as mobile phones, tablets, or cameras. The major challenge is to understand the visual semantics and convert it into a condensed format, such as caption or summary to save storage space, enables users to index, navigate, and help gain information in less time. We propose an innovative joint end-to-end solution, Abstractive Summarization of Video Sequences, which uses the deep neural network to generate the natural language description and abstractive text summarization of an input video. This provides a text-based video description and abstractive summary, enabling users to discriminate between relevant and irrelevant information according to their needs. Furthermore, our experiments show that the joint model can attain better results than the baseline methods in separate tasks with informative, concise, and readable multi-line video description and summary in a human evaluation.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.