2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2020
DOI: 10.1109/cvpr42600.2020.00081
|View full text |Cite
|
Sign up to set email alerts
|

Intra- and Inter-Action Understanding via Temporal Action Parsing

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
24
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
4
3
2

Relationship

0
9

Authors

Journals

citations
Cited by 55 publications
(24 citation statements)
references
References 36 publications
0
24
0
Order By: Relevance
“…As a result, most temporal localization methods [49-51, 66, 67, 84] contain a temporal proposal module to simply treat video segments that do not belong to pre-defined classes as the background. Temporal segmentation methods [18,43,60] typically divide a video into segments of actions, or sub-actions [62,63]. But still, those methods can only predict boundaries of pre-defined classes, not generic boundaries.…”
Section: Related Workmentioning
confidence: 99%
“…As a result, most temporal localization methods [49-51, 66, 67, 84] contain a temporal proposal module to simply treat video segments that do not belong to pre-defined classes as the background. Temporal segmentation methods [18,43,60] typically divide a video into segments of actions, or sub-actions [62,63]. But still, those methods can only predict boundaries of pre-defined classes, not generic boundaries.…”
Section: Related Workmentioning
confidence: 99%
“…To use the high-level segment information to refine the erroneous predictions in low-level frames, we capture the relations between frames F 0 and its corresponding segments F 1 and F 2 and F 3 , respectively. Recently, Transformer (Vaswani et al 2017), a kind of attention layer, shows promising results in learning attentive weights/relationships with applications in images, words, and videos (Vaswani et al 2017;Shao et al 2020;Dosovitskiy et al 2020;Arnab et al 2021). Here, we adopt the transformer to learn the relationships between segments and frames.…”
Section: Segment-frame Attention (Sfa) Modulementioning
confidence: 99%
“…Traditional fully-supervised deep learning methods typically require large amounts of annotated data, introducing a significant proneto-ambiguity annotation workload [36,37,47,55]. For this reason, learning with scarce data (i.e., few-shot learning) has received increasing attention, in domains like object detection [8,14,28,[42][43][44], action recognition [1,2,4,13,54,59], and action localization [9,15,51,52].…”
Section: Related Workmentioning
confidence: 99%