2019 IEEE Winter Conference on Applications of Computer Vision (WACV) 2019
DOI: 10.1109/wacv.2019.00056
|View full text |Cite
|
Sign up to set email alerts
|

Online Video Summarization: Predicting Future to Better Summarize Present

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
8
0
1

Year Published

2021
2021
2023
2023

Publication Types

Select...
7
1
1

Relationship

0
9

Authors

Journals

citations
Cited by 26 publications
(10 citation statements)
references
References 11 publications
0
8
0
1
Order By: Relevance
“…Aiming to better learn how to estimate the importance of video frames/fragments, some techniques pay attention to both the spatial and temporal structure of the video. [33] presents an Encoder-Decoder architecture with convolutional LSTMs that models the spatiotemporal relationship among parts of the video. [34] uses 3D-CNNs and convolutional LSTMs to model the spatiotemporal structure of the video and select the video key-frames, while [35] extracts spatial and temporal information by processing the raw frames and their optical flow maps with CNNs.…”
Section: A Supervised Video Summarizationmentioning
confidence: 99%
“…Aiming to better learn how to estimate the importance of video frames/fragments, some techniques pay attention to both the spatial and temporal structure of the video. [33] presents an Encoder-Decoder architecture with convolutional LSTMs that models the spatiotemporal relationship among parts of the video. [34] uses 3D-CNNs and convolutional LSTMs to model the spatiotemporal structure of the video and select the video key-frames, while [35] extracts spatial and temporal information by processing the raw frames and their optical flow maps with CNNs.…”
Section: A Supervised Video Summarizationmentioning
confidence: 99%
“…Deep convolutional networks are typically used to extract features to describe individual video frames, and video shots can be represented using aggregations of frame-level representations such as the average feature vector [11] or representative vector [12]. Other methods use deep networks to directly predict the relevance of video frames [13] or deep recurrent networks to predict the relevance of entire video portions [11], [14], [15]. Some works create video skims by treating key-shot selection as object detection by using temporal region proposals to identify the potential locations of key-shots [10].…”
Section: A Video Summarizationmentioning
confidence: 99%
“…Once again, the predicted importance scores are compared with the ground-truth data and the outcome of this comparison guides the training of the Summarizer. From this perspective, [37] presents an encoder-decoder architecture with convolu-tional LSTMs, that models the spatiotemporal relationship among parts of the video. In addition to the estimates about the frames' importance, the algorithm enhances the visual diversity of the summary via next frame prediction and shot detection mechanisms, based on the intuition that the first frames of a shot generally have high likelihood of being part of the summary.…”
Section: A Supervised Video Summarizationmentioning
confidence: 99%
“…Alternatively, they are extended by memory networks to increase the memorization capacity of the network and capture long-range temporal dependencies among parts of the video ( [35], [36]). Going one step further, a group of techniques try to learn importance by paying attention to both the spatial and temporal structure of the video, using convolutional LSTMs [37], [38], optical flow maps [39], combinations of CNNs and RNNs [40], or motion extraction mechanisms [42]. Following a different approach, a couple of supervised methods learn how to generate video summaries that are aligned with the human preferences the help of GANs ( [43], [44]).…”
Section: E General Remarks On Deep Learning Approachesmentioning
confidence: 99%