“…SRL in Vision: has been explored in the context of human object interaction (Gupta and Malik, 2015), situation recognition (Yatskar et al, 2016), and multi-media extraction (Li et al, 2020). Most related to ours is the usage of SRLs for grounding (Silberer and Pinkal, 2018) in images and videos (Sadhu et al, 2020). Our work builds on (Sadhu et al, 2020) in using SRLs on video descriptions, however, our focus is not on grounding.…”