2022
DOI: 10.1109/tmm.2021.3073235
|View full text |Cite
|
Sign up to set email alerts
|

Action Coherence Network for Weakly-Supervised Temporal Action Localization

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
9
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
5
4

Relationship

1
8

Authors

Journals

citations
Cited by 23 publications
(9 citation statements)
references
References 71 publications
0
9
0
Order By: Relevance
“…Multiple methods use features extracted from a pre-trained two-stream model, I3D [32], as input to their weakly supervised model [23][24][25]. In addition to MIL, supervision on feature similarity or difference between clips in videos [23,25,33] and adversarial erasing of clip predictions [34,35] are also used to encourage localization predictions that are temporally complete. In early stages of our study, we mainly attempted a MIL approach, but the noisy data, low number of samples and similar appearance of videos from the two classes were prohibitive for such a model to learn informative patterns.…”
Section: Weakly Supervised Action Recognition and Localizationmentioning
confidence: 99%
“…Multiple methods use features extracted from a pre-trained two-stream model, I3D [32], as input to their weakly supervised model [23][24][25]. In addition to MIL, supervision on feature similarity or difference between clips in videos [23,25,33] and adversarial erasing of clip predictions [34,35] are also used to encourage localization predictions that are temporally complete. In early stages of our study, we mainly attempted a MIL approach, but the noisy data, low number of samples and similar appearance of videos from the two classes were prohibitive for such a model to learn informative patterns.…”
Section: Weakly Supervised Action Recognition and Localizationmentioning
confidence: 99%
“…Multiple methods use features extracted from a pre-trained twostream model, I3D [8], as input to their weakly supervised model [20,29,33]. In addition to MIL, supervision on feature similarity or difference between clips in videos [20,29,48], and adversarial erasing of clip predictions [36,47] are also used to encourage localization predictions that are temporally complete.…”
Section: Related Workmentioning
confidence: 99%
“…Set-supervised Learning. The set of actions present in training videos is assumed known in [9,21,22,23,25,32,34,35,36,40,41,45,43,44,7]. For example, Shou et al [32] specified the outer-inner-contrastive loss for learning an action boundary detector, Nguyen et al [23] defined a background-aware loss to distinguish actions from the background, and Paul et al [25] proposed an action affinity loss for multi-instance learning.…”
Section: Related Workmentioning
confidence: 99%