Weakly-Supervised Moment Retrieval Network for Video Corpus Moment Retrieval

Yoon, Sunjae; Kim, Dahyun; Hong, Ji Woo; Kim, Junyeong; Kim, Kookhoi; Yoo, Chang D.

doi:10.1109/icip42928.2021.9506218

Cited by 4 publications

(3 citation statements)

References 14 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The green curve denotes the L ar with L rub and it shows further optimizing compared to without L rub as the L g(e) ar decreases. This denotes that neural networks can be further optimized according to the training epochs by calibrating their training objectives, which is also validated in other multi-modal systems (Yoon et al, 2023;Zheng et al, 2022) in other ways.…”

Section: Ablation Studymentioning

confidence: 76%

HEAR: Hearing Enhanced Audio Response for Video-grounded Dialogue

Yoon,

Kim,

Yoon

et al. 2023

Findings of the Association for Computational Linguistics: EMNLP 2023

View full text Add to dashboard Cite

Video-grounded Dialogue (VGD) aims to answer questions regarding a given multi-modal input comprising video, audio, and dialogue history. Although there have been numerous efforts in developing VGD systems to improve the quality of their responses, existing systems are competent only to incorporate the information in the video and text and tend to struggle in extracting the necessary information from the audio when generating appropriate responses to the question. The VGD system seems to be deaf, and thus, we coin this symptom of current systems' ignoring audio data as a deaf response. To overcome the deaf response problem, Hearing Enhanced Audio Response (HEAR) framework is proposed to perform sensible listening by selectively attending to audio whenever the question requires it. The HEAR framework enhances the accuracy and audibility of VGD systems in a model-agnostic manner. HEAR is validated on VGD datasets (i.e., AVSD@DSTC7 and AVSD@DSTC8) and shows effectiveness with various VGD systems.

show abstract

Section: Ablation Studymentioning

confidence: 76%

HEAR: Hearing Enhanced Audio Response for Video-grounded Dialogue

Yoon,

Kim,

Yoon

et al. 2023

Findings of the Association for Computational Linguistics: EMNLP 2023

View full text Add to dashboard Cite

show abstract

“…Thus our future work is to build a dataset to perform a more general format of RHL tasks by building real environmental data under more diverse conditions such as the co-occurrence of human and outdoor environments. Furthermore, we also consider extending the work of the current training framework of CLNet to be performed in weakly-supervised settings [26], [27], which mitigates the reliance on temporal annotations to train localization in MD signatures.…”

Section: Limitationmentioning

confidence: 99%

Causal Localization Network for Radar Human Localization With Micro-Doppler Signature

Yoon,

Koo,

Shim

et al. 2024

IEEE Access

Self Cite

View full text Add to dashboard Cite

Micro-Doppler (MD) signature includes unique characteristics by different-sized body parts such as arms, legs, and torso. Existing radar identification systems have made an effort to classify the identification of humans using these characteristics presented in MD signatures while achieving a remarkable performance of classification. However, we argue that the radar identification system also should be extended to perform more fine-grained tasks to achieve the flexibility of the identification. In this paper, we introduce a radar human localization (RHL) task, which involves temporally localizing human identifications within untrimmed MD signatures. To enable RHL, we have constructed a micro-Doppler dataset referred to as IDRad-TBA. Furthermore, we propose Causal Localization Network (CLNet) as the RHL baseline system built upon the IDRad-TBA dataset. CLNet employs a novel temporal causal prediction approach for MD signature localization. Experimental results validate the effectiveness of CLNet in performing the RHL task. Our project is available at: https://github.com/dbstjswo505/CLNet INDEX TERMS deep learning, temporal human identification, micro-Doppler radar, information retrieval.

show abstract

“…For the video encoder in grounding model, we follow previous methods [10], [61] to utilize I3D [67] for C-STA and C3D [68] for ANC. The features are extracted by downsampling each video at a rate of 8, and the maximum video segments is set as 200.…”

Section: B Implementation Detailsmentioning

confidence: 99%

Multi-Hierarchical Category Supervision for Weakly-Supervised Temporal Action Localization

Wang

et al. 2021

IEEE Trans. on Image Process.

View full text Add to dashboard Cite

Early weakly supervised video grounding (WSVG) methods often struggle with incomplete boundary detection due to the absence of temporal boundary annotations. To bridge the gap between video-level and boundary-level annotation, explicitsupervision methods, i.e. , generating pseudo-temporal boundaries for training, have achieved great success. However, data augmentations in these methods might disrupt critical temporal information, yielding poor pseudo boundaries. In this paper, we propose a new perspective that maintains the integrity of the original temporal content while introducing more valuable information for expanding the incomplete boundaries. To this end, we propose EtC (Expand then Clarify), first use the additional information to expand the initial incomplete pseudo boundaries, and subsequently refine these expanded ones to achieve precise boundaries. Motivated by video continuity, i.e. , visual similarity across adjacent frames, we use powerful multimodal large language models (MLLMs) to annotate each frame within initial pseudo boundaries, yielding more comprehensive descriptions for expanded boundaries. To further clarify the noise of expanded boundaries, we combine mutual learning with a tailored proposal-level contrastive objective to use a learnable approach to harmonize a balance between incomplete yet clean (initial) and comprehensive yet noisy (expanded) boundaries for more precise ones. Experiments demonstrate the superiority of our method on two challenging WSVG datasets.

show abstract

Weakly-Supervised Moment Retrieval Network for Video Corpus Moment Retrieval

Cited by 4 publications

References 14 publications

HEAR: Hearing Enhanced Audio Response for Video-grounded Dialogue

HEAR: Hearing Enhanced Audio Response for Video-grounded Dialogue

Causal Localization Network for Radar Human Localization With Micro-Doppler Signature

Multi-Hierarchical Category Supervision for Weakly-Supervised Temporal Action Localization

Contact Info

Product

Resources

About