2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2021
DOI: 10.1109/cvpr46437.2021.01248
|View full text |Cite
|
Sign up to set email alerts
|

Multi-stage Aggregated Transformer Network for Temporal Language Localization in Videos

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
49
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
4
2
1

Relationship

0
7

Authors

Journals

citations
Cited by 63 publications
(49 citation statements)
references
References 31 publications
0
49
0
Order By: Relevance
“…In [117], [142], SA is entirely replaced by CA, but it can also be kept [54]. Indeed, [53] has their modalities co-attending to each other and self-attending to themselves, claiming this keeps intra-modal and inter-modal dynamics separate, up to some degree. In contrast to encoder fusion, the computational cost is reduced to O((…”
Section: Cross-attention Fusion (Caf)mentioning
confidence: 99%
See 3 more Smart Citations
“…In [117], [142], SA is entirely replaced by CA, but it can also be kept [54]. Indeed, [53] has their modalities co-attending to each other and self-attending to themselves, claiming this keeps intra-modal and inter-modal dynamics separate, up to some degree. In contrast to encoder fusion, the computational cost is reduced to O((…”
Section: Cross-attention Fusion (Caf)mentioning
confidence: 99%
“…Other approaches involve discretizing video tokens using quantization [50] or classifying the contents present in them [13]. Also, in [53] authors propose an asymmetrical use of MTP where only textual tokens are masked, but visual context is used to reconstruct them. The most common approach, however, is found in other works that propose adapting a variant of the Noise Contrastive Estimation (NCE) [178], which we detail when discussing contrastive losses.…”
Section: Self-supervised Tasksmentioning
confidence: 99%
See 2 more Smart Citations
“…SV-VMR [94] decomposes query into multiple semantic roles [95] and performs multi-level cross-modal reasoning at semantic level. MATN [52] further concatenates proposals and query words into a sequence, and encodes them through a single-stream transformer network. It also devises a novel multi-stage boundary regression to refine the predicted moments.…”
Section: Temporal Adjacent Networkmentioning
confidence: 99%