With the prevalence of video sharing, there are increasing demands for automatic video digestion such as highlight detection. Recently, platforms with crowdsourced time-sync video comments have emerged worldwide, providing a good opportunity for highlight detection. However, this task is non-trivial: (1) time-sync comments often lag behind their corresponding shot; (2) time-sync comments are semantically sparse and noisy; (3) to determine which shots are highlights is highly subjective. The present paper aims to tackle these challenges by proposing a framework that (1) uses concept-mapped lexical-chains for lagcalibration; (2) models video highlights based on comment intensity and combination of emotion and concept concentration of each shot; (3) summarize each detected highlight using improved SumBasic with emotion and concept mapping. Experiments on large real-world datasets show that our highlight detection method and summarization method both outperform other benchmarks with considerable margins.
IntroductionEvery day, people watch billions of hours of videos on YouTube, with half of the views on mobile devices 1 . With the prevalence of video shar-1 https://www.youtube.com/yt/press/statistics.html ing, there is increasing demand for fast video digestion. Imagine a scenario where a user wants to quickly grasp a long video, without dragging the progress bar repeatedly to skip shots unappealing to the user. With automatically-generated highlights, users could digest the entire video in minutes, before deciding whether to watch the full video later. Moreover, automatic video highlight detection and summarization could benefit video indexing, video search and video recommendation. However, finding highlights from a video is not a trivial task. First, what is considered to be a "highlight" can be very subjective. Second, a highlight may not always be captured by analyzing low-level features in image, audio and motions. Lack of abstract semantic information has become a bottleneck of highlight detection in traditional video processing.Recently, crowdsourced time-sync video comments, or "bullet-screen comments" have emerged, where real-time generated comments will be flying over or besides the screen, synchronized with the video frame by frame. It has gained popularity worldwide, such as niconico in Japan, Bilibili and Acfun in China, YouTube Live and Twitch Live in USA. The popularity of the timesync comments has suggested new opportunities for video highlight detection based on natural language processing.Nevertheless, it is still a challenge to detect and label highlights using time-sync comments. First, there is almost inevitable lag for comments related to each shot. As in Figure 1, ongoing discussion about one shot may extend to next a few shots. Highlight detection and labeling without lagcalibration may cause inaccurate results. Second, 1 time-sync comments are sparse semantically, both in number of comments per shot and number of tokens per comment. Traditionally bag-of-words statistical model may...