“…Challenge Sets The idea of creating challenging contrastive evaluation sets has a long history (Levesque et al, 2011;Ettinger et al, 2017;Glockner et al, 2018;Naik et al, 2018;Isabelle et al, 2017). Challenge sets exist for various phenomena, including ones with "minimal" edits similar to our contrast sets, e.g., in image captioning (Shekhar et al, 2017), machine translation (Sennrich, 2017;Burlot and Yvon, 2017;Burlot et al, 2018), and language modeling (Marvin and Linzen, 2018;Warstadt et al, 2019). Minimal pairs of edits that perturb gender or racial attributes are also useful for evaluating social biases Zhao et al, 2018;Lu et al, 2018).…”