“…Our dataset contains human judgements on the performance of nine MT systems on the translation of the 250 pronouns in the PROTEST test suite. The systems include five submissions to the DiscoMT 2015 shared task on pronoun translation (Hardmeier et al, 2015) -four phrase-based SMT systems AUTO-POSTEDIT (Guillou, 2015), UU-HARDMEIER (Hardmeier et al, 2015), IDIAP (Luong et al, 2015), UU-TIEDEMANN (Tiedemann, 2015), a rule-based system ITS2 (Loáiciga and Wehrli, 2015), and the shared task baseline (also phrase-based SMT). Three NMT systems are included for comparison: LIMSI (Bawden et al, 2017), NYU (Jean et al, 2014), and YANDEX (Voita et al, 2018).…”