“…UCF101 [36], HMDB [22] and Kinetics [19] were widely used for recog-nizing actions in video clips [40,29,45,8,35,44,7,28,26,38,18,41]; THUMOS [17], ActivityNet [4] and AVA [13] were introduced for temporal/spatial-temporal action localization [33,48,27,37,52,53,3,5,24]. Recently, significant attention has been drawn to model human-human [13] and human-object interactions in daily actions [31,34,42]. In contrast to these datasets that were designed to evaluate motion and appearance modeling, or human-object interactions, our Agent-in-Place Action (APA) dataset is the first one that focuses on actions that are defined with respect to scene layouts, including interaction with places and moving directions.…”