“…We compare single models and ensemble models. For a fair comparison, we only compare with results obtained without external contextualized embed- (Chen et al, 2017) 4.3M 88.0 DIIN (Gong et al, 2018) 4.4M 88.0 MwAN (Tan et al, 2018) 14M 88.3 CAFE (Tay et al, 2018b) 4.7M 88.5 HIM (Chen et al, 2017) 7.7M 88.6 SAN (Liu et al, 2018) 3.5M 88.6 CSRAN (Tay et al, 2018a) 13.9M 88.7 DRCN (Kim et al, 2018) 6.7M 88.9 RE2 ( Model Acc(%) ESIM (Chen et al, 2017) 70.6 DecompAtt (Parikh et al, 2016) 72.3 DGEM (Khot et al, 2018) 77.3 HCRN (Tay et al, 2018c) 80.0 CAFE (Tay et al, 2018b) 83.3 CSRAN (Tay et al, 2018a) 86.7 RE2 (ours) 86.0±0.6 dings. In the ensemble experiment, we train 8 models with different random seeds and ensemble the results by a voting strategy.…”