“…Several works tackle this problem from a graphbased perspective [40,63,100,101] such as applying Graph Convolutional Networks (GCNs) [49,96]. More recent works utilize attention modeling [63,73,98,103] including using Transformers [26,57] with a focus on determining the most critical persons [26,72,96,103], groups [24,57], or interactions [101]. Existing works have primarily use RGB-and optical-flow-based features with RoIAlign [33] to represent individuals [6,73,96,100].…”