“…Set-supervised Learning. The set of actions present in training videos is assumed known in [9,21,22,23,25,32,34,35,36,40,41,45,43,44,7]. For example, Shou et al [32] specified the outer-inner-contrastive loss for learning an action boundary detector, Nguyen et al [23] defined a background-aware loss to distinguish actions from the background, and Paul et al [25] proposed an action affinity loss for multi-instance learning.…”