“A Picture is worth a thousand words”. Given an image, humans are able to deduce various cause-and-effect captions of past, current, and future events beyond the image. The task of visual commonsense generation aims at generating three cause-and-effect captions (1) what needed to happen before, (2) what is the current intent, and (3) what will happen after for a given image. However, such a task is challenging for machines owing to two limitations: existing approaches (1) directly utilize conventional vision-language transformers to learn relationships between input modalities, and (2) ignore relations among target cause-and-effect captions but consider each caption independently. We propose Cause-and-Effect BART (CE-BART) which is based on (1) Structured Graph Reasoner that captures intra- and inter-modality relationships among visual and textual representations, and (2) Cause-and-Effect Generator that generates cause-and-effect captions by considering the causal relations among inferences. We demonstrate the validity of CE-BART on VisualCOMET and AVSD benchmarks. CE-BART achieves SOTA performances on both benchmarks, while extensive ablation study and qualitative analysis demonstrate the performance gain and improved interpretability.
This paper considers a Deep Convolutional Neural Network (DCNN) with an attention mechanism referred to as Dual-Scale Doppler Attention (DSDA) for human identification given a micro-Doppler (MD) signature induced as input. The MD signature includes unique gait characteristics by different sized body parts moving, as arms and legs move rapidly, while the torso moves slowly. Each person is identified based on his/her unique gait characteristic in the MD signature. DSDA provides attention at different time-frequency resolutions to cater to different MD components composed of both fast-varying and steady. Through this, DSDA can capture the unique gait characteristic of each person used for human identification. We demonstrate the validity of DSDA on a recently published benchmark dataset, IDRad. The empirical results show that the proposed DSDA outperforms previous methods, using a qualitative analysis interpretability on MD signatures.
“A Picture is worth a thousand words”. Given an image, humans are able to deduce various cause-and-effect captions of past, current, and future events beyond the image. The task of visual commonsense generation has the aim of generating three cause-and-effect captions for a given image: (1) what needed to happen before, (2) what is the current intent, and (3) what will happen after. However, this task is challenging for machines, owing to two limitations: existing approaches (1) directly utilize conventional vision–language transformers to learn relationships between input modalities and (2) ignore relations among target cause-and-effect captions, but consider each caption independently. Herein, we propose Cause-and-Effect BART (CE-BART), which is based on (1) a structured graph reasoner that captures intra- and inter-modality relationships among visual and textual representations and (2) a cause-and-effect generator that generates cause-and-effect captions by considering the causal relations among inferences. We demonstrate the validity of CE-BART on the VisualCOMET and AVSD benchmarks. CE-BART achieved SOTA performance on both benchmarks, while an extensive ablation study and qualitative analysis demonstrated the performance gain and improved interpretability.
Video corpus moment retrieval aims to localize temporal moments corresponding to textual query in a large video corpus. Previous moment retrieval systems are largely grouped into two categories:(1) anchor-based method which presets a set of video segment proposals (via sliding window) and predicts proposal that best matches with the query, and (2) anchor-free method which directly predicts frame-level start-end time of the moment related to the query (via regression). Both methods have their own inherent weaknesses: (1) anchor-based method is vulnerable to heuristic rules of generating video proposals, which causes restrictive moment prediction in variant length; and (2) anchor-free method, as is based on framelevel interplay, suffers from insufficient understanding of contextual semantics from long and sequential video. To overcome the aforementioned challenges, our proposed Cascaded Moment Proposal Network incorporates the following two main properties: (1) Hierarchical Semantic Reasoning which provides video understanding from anchor-free level to anchor-based level via building hierarchical video graph, and (2) Cascaded Moment Proposal Generation which precisely performs moment retrieval via devising cascaded multi-modal feature interaction among anchor-free and anchor-based video semantics. Extensive experiments show state-of-the-art performance on three moment retrieval benchmarks (TVR, ActivityNet, DiDeMo), while qualitative analysis shows improved interpretability. The code will be made publicly available.INDEX TERMS Video corpus moment retrieval, cascaded moment proposal, multi-modal interaction, vision-language system.
Grant funded by the Korea Government through MSIT (Development of causal AI through video understanding and development and study of AI technologies to inexpensively conform to evolving policy on ethics) under Grant 2021-0-01381 and Grant 2022-0-00184.ABSTRACT 3D human pose and shape estimation (3D-HPSE) from video aims to generate sequence of 3D mesh that depict human body in the video. Current deep learning based 3D-HPSE networks that takes video input have focused on improving temporal consistency among sequence of 3D joints by supervising acceleration error between predicted and ground-truth human motion. However, these methods overlooked the geometric misalignments of persistent discrepancy between geometric paths drawn by sequence of predicted joints and that of ground-truth joints. To this end, we propose Joint Path Alignment (JPA) framework, a model-agnostic approach that mitigates geometric misalignments by introducing Temporal Procrustes Alignment Regularization (TPAR) loss that performs group-wise sequence learning of joint movement paths. Unlike previous methods that rely solely on per-frame supervision for accuracy, our framework adds sequence-level accuracy supervision with TPAR loss by performing Procrustes analysis on the geometric paths drawn by sequences of predicted joints. Our experiments show that JPA framework advances the network to exceed the previous state-of-the-art performances on benchmark datasets in both per-frame accuracy and video smoothness metric.INDEX TERMS 3D human pose and shape estimation from video, temporal alignment, Procrustes analysis.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.