Use of Affective Visual Information for Summarization of Human-Centric Videos

Köprü, Berkay; Erzin, Engin

doi:10.48550/arxiv.2107.03783

Cited by 1 publication

(2 citation statements)

References 40 publications

(86 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Ji et al [ 16 ] solve the problem of short-term contextual attention insufficiency and distribution inconsistency. Köprü [ 17 ] proposes two new architectures based on temporal attention (TA-AVSUM) and spatial attention (SA-AVSUM).…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

A Hierarchical Spatial–Temporal Cross-Attention Scheme for Video Summarization Using Contrastive Learning

Teng

Gui

et al. 2022

Sensors

View full text Add to dashboard Cite

Video summarization (VS) is a widely used technique for facilitating the effective reading, fast comprehension, and effective retrieval of video content. Certain properties of the new video data, such as a lack of prominent emphasis and a fuzzy theme development border, disturb the original thinking mode based on video feature information. Moreover, it introduces new challenges to the extraction of video depth and breadth features. In addition, the diversity of user requirements creates additional complications for more accurate keyframe screening issues. To overcome these challenges, this paper proposes a hierarchical spatial–temporal cross-attention scheme for video summarization based on comparative learning. Graph attention networks (GAT) and the multi-head convolutional attention cell are used to extract local and depth features, while the GAT-adjusted bidirection ConvLSTM (DB-ConvLSTM) is used to extract global and breadth features. Furthermore, a spatial–temporal cross-attention-based ConvLSTM is developed for merging hierarchical characteristics and achieving more accurate screening in similar keyframes clusters. Verification experiments and comparative analysis demonstrate that our method outperforms state-of-the-art methods.

show abstract

Section: Related Workmentioning

confidence: 99%

“…L cro is the loss function of cross-attention. Both L mat and L dat are cross-entropy, as defined by Equation (17). To resolve the centralization issue and reduce the ambiguity problem in key frame filtering, we use L cen for centralization keyframe scores:…”

Section: Loss Functionmentioning

confidence: 99%

A Hierarchical Spatial–Temporal Cross-Attention Scheme for Video Summarization Using Contrastive Learning

Teng

Gui

et al. 2022

Sensors

View full text Add to dashboard Cite

show abstract

Use of Affective Visual Information for Summarization of Human-Centric Videos

Cited by 1 publication

References 40 publications

A Hierarchical Spatial–Temporal Cross-Attention Scheme for Video Summarization Using Contrastive Learning

A Hierarchical Spatial–Temporal Cross-Attention Scheme for Video Summarization Using Contrastive Learning

Contact Info

Product

Resources

About