Hierarchical Multimodal Attention for Deep Video Summarization

Sanabria, Melissa; Precioso, Fŕed́eric; Menguy, Thomas

doi:10.1109/icpr48806.2021.9413097

Cited by 16 publications

(9 citation statements)

References 46 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Fig. 4 illustrates the concept of our fusion mechanism, inspired by the method presented in [17], that fuses the information from both modalities in a hierarchical fashion. First, we feed each video (u V t ) T t=1 and audio (u A t ) T t=1 embedded sequences to separated 128-dim BiGRU layers with hidden state h {V,A} t…”

Section: Cross-modality Fusionmentioning

confidence: 99%

“…In [16,17], the λ weights are learnt using perceptrons that fully-connect, at any given time, the hidden representation of each modality. This paper presents a novel approach to compute the weights of the modalities using: (1) an estimation of the uncertainty of both video and audio embedded representations, and (2) self-attention to measure the importance of video and audio modalities in their local temporal context.…”

Section: Cross-modality Fusionmentioning

confidence: 99%

“…Figures 5b and 5d compare the performance of our fusion method with the Naïve and Hierarchical [17] Fusion schemes in different scenarios. Our multi-objective model is first evaluated using a naïve concatenation fusion (NF) to combine video and audio modalities.…”

Section: Performance Breakdownmentioning

confidence: 99%

“…This result highlights the effectiveness of our end-to-end multi-objective learning. Hierarchical Fusion (HF) refers the fusion scheme presented in [17]. Note that our adaptation implies a slight modification of the initial method to match our "many-to-many" architecture.…”

Section: Performance Breakdownmentioning

confidence: 99%

“…Hore et al [16] proposed a model that selectively uses features from different modalities. This approach was later improved using hierarchical attention fusion in [17]. Despite the growing interest of these fusion mechanisms in other disciplines, to the best of our knowledge, no studies have been conducted on ASD problem.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Active Speaker Detection as a Multi-Objective Optimization with Uncertainty-Based Multimodal Fusion

Pouthier¹,

Pilati²,

Gudupudi³

et al. 2021

Interspeech 2021

Self Cite

View full text Add to dashboard Cite

It is now well established from a variety of studies that there is a significant benefit from combining video and audio data in detecting active speakers. However, either of the modalities can potentially mislead audiovisual fusion by inducing unreliable or deceptive information. This paper outlines active speaker detection as a multi-objective learning problem to leverage best of each modalities using a novel self-attention, uncertaintybased multimodal fusion scheme. Results obtained show that the proposed multi-objective learning architecture outperforms traditional approaches in improving both mAP and AUC scores. We further demonstrate that our fusion strategy surpasses, in active speaker detection, other modality fusion methods reported in various disciplines. We finally show that the proposed method significantly improves the state-of-the-art on the AVA-ActiveSpeaker dataset.

show abstract

Section: Cross-modality Fusionmentioning

confidence: 99%

Section: Cross-modality Fusionmentioning

confidence: 99%

Section: Performance Breakdownmentioning

confidence: 99%

Section: Performance Breakdownmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Active Speaker Detection as a Multi-Objective Optimization with Uncertainty-Based Multimodal Fusion

Pouthier¹,

Pilati²,

Gudupudi³

et al. 2021

Interspeech 2021

Self Cite

View full text Add to dashboard Cite

show abstract

A topic modeling‐based bibliometric exploration of automatic summarization research

Chen,

Xie,

Tao

et al. 2024

WIREs Data Min & Knowl

View full text Add to dashboard Cite

The surge in text data has driven extensive research into developing diverse automatic summarization approaches to effectively handle vast textual information. There are several reviews on this topic, yet no large‐scale analysis based on quantitative approaches has been conducted. To provide a comprehensive overview of the field, this study conducted a bibliometric analysis of 3108 papers published from 2010 to 2022, focusing on automatic summarization research regarding topics and trends, top sources, countries/regions, institutions, researchers, and scientific collaborations. We have identified the following trends. First, the number of papers has experienced 65% growth, with the majority being published in computer science conferences. Second, Asian countries and institutions, notably China and India, actively engage in this field and demonstrate a strong inclination toward inter‐regional international collaboration, contributing to more than 24% and 20% of the output, respectively. Third, researchers show a high level of interest in multihead and attention mechanisms, graph‐based semantic analysis, and topic modeling and clustering techniques, with each topic having a prevalence of over 10%. Finally, scholars have been increasingly interested in self‐supervised and zero/few‐shot learning, multihead and attention mechanisms, and temporal analysis and event detection. This study is valuable when it comes to enhancing scholars' and practitioners' understanding of the current hotspots and future directions in automatic summarization.This article is categorized under: Algorithmic Development > Text Mining

show abstract