“…The GLUE and SuperGlue datasets include diagnostic sets where annotators manually labeled samples of examples as requiring a broad range of linguistic phenomena. The types of phenomena manu-Proto-Roles (White et al, 2017), Paraphrastic Inference (White et al, 2017, Event Factuality (Poliak et al, 2018b;Staliūnaitė, 2018), Anaphora Resolution (White et al, 2017Poliak et al, 2018b), Lexicosyntactic Inference (Pavlick and Callison-Burch, 2016;Poliak et al, 2018b;Glockner et al, 2018), Compositionality (Dasgupta et al, 2018), Prepositions (Kim et al, 2019), Comparatives (Kim et al, 2019;Richardson et al, 2020), Quantification/Numerical Reasoning (Naik et al, 2018;Kim et al, 2019;Richardson et al, 2020), Spatial Expressions (Kim et al, 2019), Negation (Naik et al, 2018;Kim et al, 2019;Richardson et al, 2020), Tense & Aspect (Kober et al, 2019), Veridicality (Poliak et al, 2018b;, Monotonicity (Yanaka et al, 2019(Yanaka et al, , 2020Richardson et al, 2020), Presupposition (Jeretic et al, 2020), Implicatures (Jeretic et al, 2020), Temporal Reasoning (Vashishtha et al, 2020) ally labeled include lexical semantics, predicateargument structure, logic, and common sense or world knowledge. 14…”