Multimodal Saliency Models for Videos

Coutrot, Antoine; Guyader, Nathalie

doi:10.1007/978-1-4939-3435-5_16

Cited by 23 publications

(11 citation statements)

References 58 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…A. Setup 1) Datasets: The proposed method is trained and evaluated on AVAD [53], Coutrot1 [68], [69], Coutrot2 [68], [69], DIEM [70], ETMD [71], [72] and SumMe [72], [73] datasets. These datasets contains various types videos accompanied with audios.…”

Section: Experiments and Resultsmentioning

confidence: 99%

“…The dataset also contains the eye-tracking data from 16 participants. 2) The Coutrot1 and Coutrot2 datasets are split from the Coutrot dataset [68], [69]. The Coutrot1 dataset is with 60 video clips covering 4 visual categories: one moving object, several moving objects, landscapes, and faces.…”

Section: Experiments and Resultsmentioning

confidence: 99%

“…On the on hand, we compare the proposed method with 8 state-of-the-art VAP methods qualitatively on six benchmark datasets with audio-visual eye-tracking data, including AVAD [53], Coutrot1 [68], [69], Coutrot2 [68], [69], DIEM [70], ETMD [71], [72] and SumMe [72], [73] datasets. TABLE II reports the qualitative results.…”

Section: Comparison With State-of-the-artsmentioning

confidence: 99%

See 2 more Smart Citations

Bio-Inspired Representation Learning for Visual Attention Prediction

Yuan

Ning

2021

IEEE Trans. Cybern.

View full text Add to dashboard Cite

Visual Attention Prediction (VAP) is a significant and imperative issue in the field of computer vision. Most of existing VAP methods are based on deep learning. However, they do not fully take advantage of the low-level contrast features while generating the visual attention map. In this paper, a novel VAP method is proposed to generate visual attention map via bioinspired representation learning. The bio-inspired representation learning combines both low-level contrast and high-level semantic features simultaneously, which are developed by the fact that human eye is sensitive to the patches with high contrast and objects with high semantics. The proposed method is composed of three main steps: 1) feature extraction, 2) bio-inspired representation learning and 3) visual attention map generation. Firstly, the high-level semantic feature is extracted from the refined VGG16, while the low-level contrast feature is extracted by the proposed contrast feature extraction block in a deep network. Secondly, during bio-inspired representation learning, both the extracted low-level contrast and high-level semantic features are combined by the designed densely connected block, which is proposed to concatenate various features scale by scale. Finally, the weightedfusion layer is exploited to generate the ultimate visual attention map based on the obtained representations after bio-inspired representation learning. Extensive experiments are performed to demonstrate the effectiveness of the proposed method.

show abstract

Section: Experiments and Resultsmentioning

confidence: 99%

Section: Experiments and Resultsmentioning

confidence: 99%

Section: Comparison With State-of-the-artsmentioning

confidence: 99%

See 1 more Smart Citation

Bio-Inspired Representation Learning for Visual Attention Prediction

Yuan

Ning

2021

IEEE Trans. Cybern.

View full text Add to dashboard Cite

show abstract

“…At the same time, we evaluate our model on six audio-video saliency datasets: DIEM [30], Coutrot1 [11][12], Coutrot2 [11] [12], AVAD [29], ETMD [21], SumMe [16].…”

Section: Datasetsmentioning

confidence: 99%

Temporal-Spatial Feature Pyramid for Video Saliency Detection

Chang,

Zhu

2021

Preprint

View full text Add to dashboard Cite

In this paper, we propose a 3D fully convolutional encoder-decoder architecture for video saliency detection, which combines scale, space and time information for video saliency modeling. The encoder extracts multi-scale temporal-spatial features from the input continuous video frames, and then constructs temporal-spatial feature pyramid through temporal-spatial convolution and top-down feature integration. The decoder performs hierarchical decoding of temporal-spatial features from different scales, and finally produces a saliency map from the integration of multiple video frames. Our model is simple yet effective, and can run in real time. We perform abundant experiments, and the results indicate that the well-designed structure can improve the precision of video saliency detection significantly. Experimental results on three purely visual video saliency benchmarks and six audio-video saliency benchmarks demonstrate that our method achieves state-of-theart performance.

show abstract

“…Of more consequence is the lack of a model for computation of audiovisual saliency in complex video sequences. Existing literature for audio-video saliency modeling is scarce and often targets a specific class of videos [10], [27], [28]. Therefore, an extended saliency model to predict salient regions in complex videos with different sound classes is required.…”

Section: Introductionmentioning

confidence: 99%

Audiovisual Saliency Prediction in Uncategorized Video Sequences based on Audio-Video Correlation

Butt,

Rahman

2021

Preprint

View full text Add to dashboard Cite

Substantial research has been done in saliency modeling to develop intelligent machines that can perceive and interpret their surroundings. But existing models treat videos as merely image sequences excluding any audio information, unable to cope with inherently varying content. Based on the hypothesis that an audiovisual saliency model will be an improvement over traditional saliency models for natural uncategorized videos, this work aims to provide a generic audio/video saliency model augmenting a visual saliency map with an audio saliency map computed by synchronizing low-level audio and visual features. The proposed model was evaluated using different criteria against eye fixations data for a publicly available DIEM video dataset. The results show that the model outperformed two state-of-the-art visual saliency models.

show abstract

Multimodal Saliency Models for Videos

Cited by 23 publications

References 58 publications

Bio-Inspired Representation Learning for Visual Attention Prediction

Bio-Inspired Representation Learning for Visual Attention Prediction

Temporal-Spatial Feature Pyramid for Video Saliency Detection

Audiovisual Saliency Prediction in Uncategorized Video Sequences based on Audio-Video Correlation

Contact Info

Product

Resources

About