“…Commonsense Reasoning Benchmarks Many benchmarks measuring the commonsense reasoning abilities of state-of-the-art models have been released in recent years. Starting with the well-known Winograd Schema Challenge (WSC; Levesque et al, 2011), these benchmarks have attempted to test the commonsense reasoning ability of models using different task formats, such as pronoun resolution (Levesque et al, 2011;Rudinger et al, 2018;Eisenschlos et al, 2023), questionanswering (Talmor et al, 2019;Zellers et al, 2018;Chen et al, 2019;Reddy et al, 2019;, plausible inference (Roemmele et al, 2011;Wang et al, 2019b;Singh et al, 2021;Gao et al, 2022) and natural language generation (Lin et al, 2020b). Benchmarks have also been created to evaluate commonsense reasoning across different dimensions of commonsense knowledge, including social (Rashkin et al, 2018a,b;Sap et al, 2019b), physical Dalvi et al, 2018;Storks et al, 2021), temporal (Qin et al, 2021;Zhou et al, 2019) and numerical reasoning (Lin et al, 2020a).…”