“…(Harman, 1993), SQuAD (Rajpurkar et al, 2018), NewsQA (Trischler et al, 2017), SearchQA (Dunn et al, 2017), and QuAC (Choi et al, 2018), and intensive efforts were made to build new models that surpass the human performance on these datasets, including the pre-trained language models (Devlin et al, 2019;Yang et al, 2019a) or the ensemble models that outperform the human, in particular on SQuAD (Lan et al, 2020;Yamada et al, 2020;. More challenging datasets are also introduced, which require several reasoning steps to answer (Yang et al, 2018;Qi et al, 2021), the understanding of a much larger context (Kočiský et al, 2018) or the understanding of the adversarial content and numeric reasoning (Dua et al, 2019).…”