“…Integrating contextual information (e.g., about the type of scene, or the presence of other objects) can increase speed and robustness, but "when and how" to do this (before, during or after the detection), it is still an open problem. Some proposed solutions include the use of (i) spatio-temporal context [e.g., Palma-Amestoy et al (2010)], (ii) spatial structure among visual words [e.g., Wu et al (2009)], and (iii) semantic information aiming to map semantically related features to visual words [e.g., Wu et al (2010)], among many others [e.g., Torralba and Sinha (2001), Divvala et al (2009), Sun et al (2012), Mottaghi et al (2014), andCadena et al (2015)]. While most methods consider the detection of objects in a single frame, temporal features can be beneficial [e.g., Viola et al (2005) and Dalal et al (2006)].…”