“…Unfortunately, methods such as BLEU (Papineni et al, 2002) have been shown to not be applicable to conversational dialogue systems (Liu et al, 2016). Following this observation, in recent years, the trend towards training methods for evaluating dialogue systems emerged (Lowe et al, 2017;Deriu and Cieliebak, 2019;Mehri and Eskenazi, 2020;Deriu et al, 2020). The models are trained to take as input a pair of context and candidate response, and output a numerical score that rates the candidate for the given context.…”