“…Baseline We compare our evaluation metrics with eleven popular automatic dialogue evaluation metrics, including three lexical word-overlap metrics: BLEU, ROUGE, and METEOR (Banerjee and Lavie 2005), five metrics that consider semantic representation: BERTScore, ADEM (Lowe et al 2017), BERT-RUBER, BLEURT, QuantiDCE (Ye et al 2021), two metrics that take into account additional information about the dialogue: DynaEval, GRADE, and Chat-GPT. Evaluation The common practice to show the effectiveness of a dialogue evaluation metric is to calculate the correlation between the model-predicted and the humanrated scores (Zhang et al 2021;Huang et al 2020).…”