“…Zhang et al (2018a), and tackle response generation, taking in previous utterances as input and the next utterance as output also specifically include the responding speaker and target addressee in the inputs and outputs). Zhang et al (2018a) report the BLEU-n (n based on n-grams, n = 1, 2, 3, 4) and METEOR (Banerjee and Lavie, 2005) scores (mentioning that the evaluation could be supplemented); report BLEU, ROUGE (Lin, 2004), noun mentions, and length of generated response, along with limited human evaluations for fluency, consistency, and informativeness; and report BLEU-n (n = 1, 2, 3, 4), METEOR, ROUGE-L (L for longest common subsequence), along with human evaluations for fluency, grammaticality, and rationality. Qiu et al (2020) focus on the dialogue thread structures which are utilized in , utilizing structured attention with Variational RNN, reporting the same automatic metrics BLEU-n (n = 1, 2, 3, 4), METEOR, ROUGE-L (L for longest common subsequence).…”