“…While human arXiv:2012.09402v1 [cs.CV] 17 Dec 2020 perception typically involves inferring the physical attributes about the humans (detection [5,35,43,50], poses [3,4,8,25,28,41], shape [13,20,29,30], gaze [44] etc. ), interpreting humans involves reasoning about the finer details relating to human activity [6,24,27,48,49], behaviour [26,34], human-object visual relationship detection [23,33,36,37,39,40], and human-object interactions [23,32,33,36,37,39,40,42]. In this work, we investigate the problem of identifying Human-Object Interactions in videos.…”