“…1 We select the MLE models with the lowest negative log-likelihood and the MLE+RL models with the highest ROUGE-L scores on a sample of validation data to evaluate on the test Model ROUGE-1 ROUGE-2 ROUGE-L SummaRuNNer (Nallapati et al, 2017) 39.60 16.20 35.30 graph-based attention (Tan et al, 2017) 38.01 13.90 34.00 pointer generator (See et al, 2017) 36.44 15.66 33.42 pointer generator + coverage (See et al, 2017) 39.53 17.28 36.38 controlled summarization with fixed values (Fan et al, 2017) 39.75 17.29 36.54 RL, with intra-attention (Paulus et al, 2018) 41.16 15.75 39.08 ML+RL, with intra-attention (Paulus et al, 2018) 39 Model Rouge-1 Rouge-2 Rouge-L ML, no intra-attention (Paulus et al, 2018) 44.26 27.43 40.41 RL, no intra-attention (Paulus et al, 2018) 47.22 30.51 43.27 ML+RL, no intra-attention (Paulus et al, 2018) 47 set. At test time, we use beam search of width 5 on all our models to generate final predictions.…”