“…These studies have found that datasets may contain many spurious artefacts, and the performance of PLMs is enhanced by excessive usage of said artefacts (Habernal et al, 2018;Niven and Kao, 2019;McCoy et al, 2019;Bender and Koller, 2020). Another line of work observed that many PLMs, which showed promising results in GLUE, fall short of expectations for more difficult tasks that require linguistic knowledge (Bhatt et al, 2021) or logical reasoning (Tian et al, 2021). As a result, the importance of well-designed evaluation datasets with higher difficulty-level has been highlighted, and new datasets, such as CHECK-LIST (Ribeiro et al, 2020), and LOGICNLI (Tian et al, 2021), have been proposed.…”