“…Traditional fully-supervised deep learning methods typically require large amounts of annotated data, introducing a significant proneto-ambiguity annotation workload [36,37,47,55]. For this reason, learning with scarce data (i.e., few-shot learning) has received increasing attention, in domains like object detection [8,14,28,[42][43][44], action recognition [1,2,4,13,54,59], and action localization [9,15,51,52]. Current works in this domain either learn using trimmed [1,2,4,22,58,60] or well-annotated untrimmed videos [51], or address class-agnostic localization tasks [9,15,52] -learning with both scarce data and limited annotation for both action recognition and localization is still an under-explored area.…”