2021
DOI: 10.48550/arxiv.2104.09829
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Detector-Free Weakly Supervised Grounding by Separation

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
5
0

Year Published

2022
2022
2022
2022

Publication Types

Select...
1
1

Relationship

0
2

Authors

Journals

citations
Cited by 2 publications
(5 citation statements)
references
References 0 publications
0
5
0
Order By: Relevance
“…As one of multi-modalities tasks (Zhang et al 2019b;Arbelle et al 2021;Zhang et al 2021;Wang et al 2018Wang et al , 2019, temporal sentence grounding is a task to temporally localize the corresponding segment in a video based on a given sentence query, initially proposed by (Gao et al 2017;Anne Hendricks et al 2017). Until now, it remains challenging since it requires understanding the semantics of both vision and language and realizing alignment between these two modalities.…”
Section: Related Work Fully Supervised Temporal Groundingmentioning
confidence: 99%
See 1 more Smart Citation
“…As one of multi-modalities tasks (Zhang et al 2019b;Arbelle et al 2021;Zhang et al 2021;Wang et al 2018Wang et al , 2019, temporal sentence grounding is a task to temporally localize the corresponding segment in a video based on a given sentence query, initially proposed by (Gao et al 2017;Anne Hendricks et al 2017). Until now, it remains challenging since it requires understanding the semantics of both vision and language and realizing alignment between these two modalities.…”
Section: Related Work Fully Supervised Temporal Groundingmentioning
confidence: 99%
“…The network, however, may easily locate the target segments due to the obvious boundary artifacts in the composed videos. To this end, we also add the irrelevant query as a regularization used in (Arbelle et al 2021), in which the network should refuse to respond when it takes the irrelevant query as an input.…”
Section: Pseudo-labels Generationmentioning
confidence: 99%
“…Following the encoder-decoder architecture, a slightly different approach consists in learning to ground entity-region by randomly blending arbitrary image pairs, which are reconstructed conditioned by the corresponding texts [3]. Leveraging the idea of a similarity measure between the two modalities, other works developed a contrastive learning framework where the model localizes entity-region by image-sentence supervision: the contrastive examples may be guided by replacing words in sentences [9], or either distilling knowledge in order to compute accurate similarity scores [24].…”
Section: Related Workmentioning
confidence: 99%
“…To assess the soundness of our approach we tested a variant of our model that replaces visual and textual branches, responsible to learn the multimodal embedding space, with CLIP's multimodal embeddings (referred as SPR baseline + CLIP) [18] [4] 37.7 -15.8 -Semantic Self-Supervision [11] -49.1 -40.0 Anchored Transformer [28] 33.1 -13.6 -Multi-level Multimodal [1] -57.9 -48.4 Align2Ground [5] 11.5 71.0 --Counterfactual Resilience [8] 48.66 ---Multimodal Alignment Framework (MAF) [25] 61.4 ---Contrastive Learning [9] -74.9 --Grounding By Separation [3] -70.5 -59.4 Relation-aware [15] 59 ble 1), our full SPR model still outperforms the variant with CLIP. This occurs because CLIP was trained to capture the multimodal coarse-grained information from image and sentence pairs, while in VG we need more fine-grained details regarding the alignments region-query.…”
Section: Full Training Set Schemementioning
confidence: 99%
See 1 more Smart Citation