AOE-Net: Entities Interactions Modeling with Adaptive Attention Mechanism for Temporal Action Proposals Generation

Vo-Ho, Viet-Khoa; Truong, Sang; Yamazaki, Kashu; Raj, Bhiksha; Tran, Minh–Triet; Le, Ngan

doi:10.1007/s11263-022-01702-9

Cited by 19 publications

(12 citation statements)

References 51 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…On NMS, AOE-Net obtains the best on AR@100 and the second best on AR@200 and AR@500 with very close gap with the SOTA, 57.49 vs. 57.74 and 62.40 vs. 62.74, respectively. Notably, the performance on TAPG in both datasets of our AOE-Net are a very competitive with AEI-B [52] and followed closely by ABN [6], both of which also incorporate local actors and global environment. This experiment strongly supports our observation and motivation on using the human perception principle to analyze human actions in untrimmed videos.…”

Section: Methodsmentioning

confidence: 85%

See 1 more Smart Citation

AOE-Net: Entities Interactions Modeling with Adaptive Attention Mechanism for Temporal Action Proposals Generation

Vo¹,

Truong²,

Yamazaki³

et al. 2022

Preprint

View full text Add to dashboard Cite

as its input and generates action proposals. Comprehensive experiments and extensive ablation studies on ActivityNet-1.3 and THUMOS-14 datasets show that our proposed AOE-Net outperforms previous stateof-the-art methods with remarkable performance and generalization for both TAPG and temporal action detection. To prove the robustness and effectiveness of AOE-Net, we further conduct an ablation study on egocentric videos, i.e. EPIC-KITCHENS 100 dataset. Our source code is publicly available at https://github.com/UARK-AICV/AOE-Net.

show abstract

Section: Methodsmentioning

confidence: 85%

“…In our AOE-Net, BMM module is adopted from previous works i.e. BSN [1], BMN [3], ABN [7], AEN [6], AEI [52] because of its standard and simple design. BMM takes the output V-L features sequence F = {f i } T i=1 from PMR module as its input.…”

Section: Boundary-matching Module (Bmm)mentioning

confidence: 99%

AOE-Net: Entities Interactions Modeling with Adaptive Attention Mechanism for Temporal Action Proposals Generation

Vo¹,

Truong²,

Yamazaki³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Top k similarity scores are chosen as language-based frame embedding feature F l i . (iii) -language feature extraction: In this step, we employ Adaptive Attention Mechanism (AAM) [3] to select the most relevant representative language features:…”

Section: Proposed Methodsmentioning

confidence: 99%

VLCAP: Vision-Language with Contrastive Learning for Coherent Video Paragraph Captioning

Yamazaki

Truong

Vo-Ho

et al. 2022

2022 IEEE International Conference on Image Processing (ICIP)

View full text Add to dashboard Cite

In this paper, we leverage the human perceiving process, that involves vision and language interaction, to generate a coherent paragraph description of untrimmed videos. We propose vision-language (VL) features consisting of two modalities, i.e., (i) vision modality to capture global visual content of the entire scene and (ii) language modality to extract scene elements description of both human and non-human objects (e.g. animals, vehicles, etc), visual and non-visual elements (e.g. relations, activities, etc). Furthermore, we propose to train our proposed VLCap under a contrastive learning VL loss. The experiments and ablation studies on ActivityNet Captions and YouCookII datasets show that our VLCap outperforms existing SOTA methods on both accuracy and diversity metrics. Source code: https://github.com/UARK-AICV/VLCAP

show abstract

“…Strong supervision gives precise boundary labels and category labels for training. There are two detailed pipelines: the top-down framework [65,63,12,7,35,73,67,98,70,75] pre-defines extensive anchors, adopts fixed-length sliding windows to produce initial proposals, then regresses to refine boundaries; the bottom-up framework [92,36,34,68,90,1] learns frame-wise boundary detectors for the boundary frames, then groups extreme frames or estimates action lengths for proposal generation. In addition, several works [10,39,78] used various fusion strategies to complement these frameworks.…”

Section: Related Workmentioning

confidence: 99%

Adaptive Mutual Supervision for Weakly-Supervised Temporal Action Localization

Chen¹,

Zhao²,

Chen³

et al. 2023

IEEE Trans. Multimedia

View full text Add to dashboard Cite

In this paper, we consider the problem of temporal action localization under low-shot (zero-shot & few-shot) scenario, with the goal of detecting and classifying the action instances from arbitrary categories within some untrimmed videos, even not seen at training time. We adopt a Transformer-based two-stage action localization architecture with class-agnostic action proposal, followed by open-vocabulary classification. We make the following contributions. First, to compensate image-text foundation models with temporal motions, we improve category-agnostic action proposal by explicitly aligning embeddings of optical flows, RGB and texts, which has largely been ignored in existing lowshot methods. Second, to improve open-vocabulary action classification, we construct classifiers with strong discriminative power, i.e., avoid lexical ambiguities. To be specific, we propose to prompt the pre-trained CLIP text encoder either with detailed action descriptions (acquired from large-scale language models), or visuallyconditioned instance-specific prompt vectors. Third, we conduct thorough experiments and ablation studies on THUMOS14 and ActivityNet1.3, demonstrating the superior performance of our proposed model, outperforming existing state-of-the-art approaches by one significant margin.

show abstract

AOE-Net: Entities Interactions Modeling with Adaptive Attention Mechanism for Temporal Action Proposals Generation

Cited by 19 publications

References 51 publications

AOE-Net: Entities Interactions Modeling with Adaptive Attention Mechanism for Temporal Action Proposals Generation

AOE-Net: Entities Interactions Modeling with Adaptive Attention Mechanism for Temporal Action Proposals Generation

VLCAP: Vision-Language with Contrastive Learning for Coherent Video Paragraph Captioning

Adaptive Mutual Supervision for Weakly-Supervised Temporal Action Localization

Contact Info

Product

Resources

About