2018
DOI: 10.1007/978-3-030-01270-0_10
|View full text |Cite
|
Sign up to set email alerts
|

AutoLoc: Weakly-Supervised Temporal Action Localization in Untrimmed Videos

Abstract: Temporal Action Localization (TAL) in untrimmed video is important for many applications. But it is very expensive to annotate the segment-level ground truth (action class and temporal boundary). This raises the interest of addressing TAL with weak supervision, namely only video-level annotations are available during training). However, the state-of-the-art weakly-supervised TAL methods only focus on generating good Class Activation Sequence (CAS) over time but conduct simple thresholding on CAS to localize ac… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
161
0

Year Published

2018
2018
2023
2023

Publication Types

Select...
5
2
1

Relationship

1
7

Authors

Journals

citations
Cited by 257 publications
(161 citation statements)
references
References 77 publications
0
161
0
Order By: Relevance
“…There are an average of 1.5 activity instances per video. As in [22,16], we use the training set to train and validation set to test our approach. Count Labels: The ground-truth count labels for the videos in both datasets are generated using the available temporal action segments information.…”
Section: Methodsmentioning
confidence: 99%
See 1 more Smart Citation
“…There are an average of 1.5 activity instances per video. As in [22,16], we use the training set to train and validation set to test our approach. Count Labels: The ground-truth count labels for the videos in both datasets are generated using the available temporal action segments information.…”
Section: Methodsmentioning
confidence: 99%
“…We report mAP scores at different IoU thresholds. Both UntrimmedNets [28] and Autoloc [22] use TSN [29] as the backbone, whereas STPN [14] and W-TALC [16] use I3D networks similar to our framework. The STPN approach obtains an mAP of 16.9 at IoU=0.5, while W-TALC achieves an mAP of 22.0.…”
Section: State-of-the-art Comparisonmentioning
confidence: 99%
“…TSR-Net [30] integrates self-attention and transfer learning with temporal localization framework to obtain precise temporal intervals in untrimmed videos. AutoLoc [31] is proposed to directly predict the temporal boundary of each action instance with an outer-inner-contrastive loss to train the boundary predictor. W-TALC [32] learns the specific network weights by optimizing two complimentary loss functions, namely coactivity similarity loss and multiple instance learning loss.…”
Section: B Action Localizationmentioning
confidence: 99%
“…Weakly Supervised Localization has been studied extensively to use weak supervisions for object detection on images and action localization in videos (Oquab et al, 2015;Bilen and Vedaldi, 2016;Kantorov et al, 2016;Jie et al, 2017;Diba et al, 2017;Papadopoulos et al, 2017;Duchenne et al, 2009;Laptev et al, 2008;Bojanowski et al, 2014;Shou et al, 2018a). Some methods use class labels to train object detectors.…”
Section: Arxiv:190900239v1 [Cscv] 31 Aug 2019mentioning
confidence: 99%
“…Gao et al utilized object counts for weakly supervised object detection . Instead of using temporally labeled segments, weakly supervised action detectors use weaker annotations, e.g., movie script (Duchenne et al, 2009;Laptev et al, 2008), the order of the occurring action classes in videos (Bojanowski et al, 2014; and video-level class labels Shou et al, 2018a).…”
Section: Arxiv:190900239v1 [Cscv] 31 Aug 2019mentioning
confidence: 99%