“…In response, recent work has focused on using crowdsourcing and automatic filtering to design large-scale benchmarks while maintaining negative examples that are adversarial to machines (Zellers et al, 2018). We will review recent benchmarks that have emerged to assess whether machines have acquired physical (e.g., Talmor et al, 2019;Zellers et al, 2019), social (e.g., Sap et al, 2019b), or temporal commonsense reasoning capabilities (e.g., , as well as benchmarks that combine commonsense abilities with other tasks (e.g., reading comprehension; Ostermann et al, 2018;.…”