This paper proposes Weakly-supervised Moment Retrieval Network (WMRN) for Video Corpus Moment Retrieval (VCMR), which retrieves pertinent temporal moments related to natural language query in a large video corpus. Previous methods for VCMR require full supervision of temporal boundary information for training, which involves a labor-intensive process of annotating the boundaries in a large number of videos. To leverage this, the proposed WMRN performs VCMR in a weakly-supervised manner, where WMRN is learned without ground-truth labels but only with video and text queries. For weakly-supervised VCMR, WMRN addresses the following two limitations of prior methods: (1) Blurry attention over video features due to redundant video candidate proposals generation, (2) Insufficient learning due to weak supervision only with video-query pairs. To this end, WMRN is based on (1) Text Guided Proposal Generation (TGPG) that effectively generates text guided multi-scale video proposals in the prospective region related to query, and (2) Hard Negative Proposal Sampling (HNPS) that enhances video-language alignment via extracting negative video proposals in positive video sample for contrastive learning. Experimental results show that WMRN achieves state-of-the-art performance on TVR and DiDeMo benchmarks in the weakly-supervised setting. To validate the attainments of proposed components of WMRN, comprehensive ablation studies and qualitative analysis are conducted.
Grant funded by the Korea Government through MSIT (Development of causal AI through video understanding and development and study of AI technologies to inexpensively conform to evolving policy on ethics) under Grant 2021-0-01381 and Grant 2022-0-00184.ABSTRACT 3D human pose and shape estimation (3D-HPSE) from video aims to generate sequence of 3D mesh that depict human body in the video. Current deep learning based 3D-HPSE networks that takes video input have focused on improving temporal consistency among sequence of 3D joints by supervising acceleration error between predicted and ground-truth human motion. However, these methods overlooked the geometric misalignments of persistent discrepancy between geometric paths drawn by sequence of predicted joints and that of ground-truth joints. To this end, we propose Joint Path Alignment (JPA) framework, a model-agnostic approach that mitigates geometric misalignments by introducing Temporal Procrustes Alignment Regularization (TPAR) loss that performs group-wise sequence learning of joint movement paths. Unlike previous methods that rely solely on per-frame supervision for accuracy, our framework adds sequence-level accuracy supervision with TPAR loss by performing Procrustes analysis on the geometric paths drawn by sequences of predicted joints. Our experiments show that JPA framework advances the network to exceed the previous state-of-the-art performances on benchmark datasets in both per-frame accuracy and video smoothness metric.INDEX TERMS 3D human pose and shape estimation from video, temporal alignment, Procrustes analysis.
“A Picture is worth a thousand words”. Given an image, humans are able to deduce various cause-and-effect captions of past, current, and future events beyond the image. The task of visual commonsense generation has the aim of generating three cause-and-effect captions for a given image: (1) what needed to happen before, (2) what is the current intent, and (3) what will happen after. However, this task is challenging for machines, owing to two limitations: existing approaches (1) directly utilize conventional vision–language transformers to learn relationships between input modalities and (2) ignore relations among target cause-and-effect captions, but consider each caption independently. Herein, we propose Cause-and-Effect BART (CE-BART), which is based on (1) a structured graph reasoner that captures intra- and inter-modality relationships among visual and textual representations and (2) a cause-and-effect generator that generates cause-and-effect captions by considering the causal relations among inferences. We demonstrate the validity of CE-BART on the VisualCOMET and AVSD benchmarks. CE-BART achieved SOTA performance on both benchmarks, while an extensive ablation study and qualitative analysis demonstrated the performance gain and improved interpretability.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.