GRADE: Automatic Graph-Enhanced Coherence Metric for Evaluating Open-Domain Dialogue Systems

Huang, Lishan; Zheng, Yi; Qin, Jinghui; Li, Lin

doi:10.18653/v1/2020.emnlp-main.742

Cited by 38 publications

(42 citation statements)

References 21 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Moreover, learnable metrics encoding the semantic information have been attracting interests recently, which are trained in a supervised manner with large-scale human-annotated data, such as ADEM (Lowe et al, 2017), or trained in an unsupervised manner with automatically constructed data, such as RUBER (Tao et al, 2018) and BERT-RUBER (Ghazarian et al, 2019). Furthermore, the recently proposed coherence metric, GRADE (Huang et al, 2020), introduces the graph information of dialogue topic transitions and achieves the current state-of-the-art results. Note that these learnable metrics are trained in a two-level training objective to separate the coherent dialogues from the incoherent ones, while our QuantiDCE models the task in a multi-level setting which is closer to the actual human rating.…”

Section: Related Workmentioning

confidence: 90%

“…Therefore, in this work, we set the number of coherence levels L = 3 where the pairs containing the random responses, the adversarial responses and the reference responses respectively belong to the levels from 1 to 3. As to the fine-tuning data, we use the DailyDialog human judgement dataset, denoted as DailyDialogEVAL, which is a subset of the adopted evaluation benchmark (Huang et al, 2020), with 300 human rating data in total, and randomly split the data into training (90%) and validation (10%) sets. Implementation Details.…”

Section: Methodsmentioning

confidence: 99%

“…Baseline Metrics. We compare the metric model trained by our QuantiDCE with eight popular automatic dialogue metrics, including three lexical word-overlap metrics: BLEU (Papineni et al, 2002), ROUGE (Lin, 2004) and ME-TEOR (Banerjee and Lavie, 2005), one semantic word-overlap metric, BERTScore , and four learnable metrics: ADEM (Lowe et al, 2017), BERT-RUBER (Ghazarian et al, 2019), BLEURT (Sellam et al, 2020) and GRADE (Huang et al, 2020).…”

Section: Methodsmentioning

confidence: 99%

See 2 more Smart Citations

Towards Quantifiable Dialogue Coherence Evaluation

Zheng¹,

Lu²,

Huang³

et al. 2021

Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Confer

Self Cite

View full text Add to dashboard Cite

Automatic dialogue coherence evaluation has attracted increasing attention and is crucial for developing promising dialogue systems. However, existing metrics have two major limitations: (a) they are mostly trained in a simplified two-level setting (coherent vs. incoherent), while humans give Likert-type multi-level coherence scores, dubbed as "quantifiable"; (b) their predicted coherence scores cannot align with the actual human rating standards due to the absence of human guidance during training. To address these limitations, we propose Quantifiable Dialogue Coherence Evaluation (QuantiDCE), a novel framework aiming to train a quantifiable dialogue coherence metric that can reflect the actual human rating standards. Specifically, QuantiDCE includes two training stages, Multi-Level Ranking (MLR) pre-training and Knowledge Distillation (KD) fine-tuning. During MLR pre-training, a new MLR loss is proposed for enabling the model to learn the coarse judgement of coherence degrees. Then, during KD fine-tuning, the pretrained model is further finetuned to learn the actual human rating standards with only very few human-annotated data. To advocate the generalizability even with limited finetuning data, a novel KD regularization is introduced to retain the knowledge learned at the pre-training stage. Experimental results show that the model trained by QuantiDCE presents stronger correlations with human judgements than the other state-of-the-art metrics. 1

show abstract

Section: Related Workmentioning

confidence: 90%

Section: Methodsmentioning

confidence: 99%

Section: Methodsmentioning

confidence: 99%

See 1 more Smart Citation

Towards Quantifiable Dialogue Coherence Evaluation

Zheng¹,

Lu²,

Huang³

et al. 2021

Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Confer

Self Cite

View full text Add to dashboard Cite

show abstract

“…Context-Aware NMT. In a sense, chat MT can be viewed as a special case of context-aware MT that has many related studies (Gong et al, 2011;Jean et al, 2017;Wang et al, 2017b;Zheng et al, 2020;Yang et al, 2019;Kang et al, 2020;Ma et al, 2020). Typically, they resort to extending conventional NMT models for exploiting the context.…”

Section: Related Workmentioning

confidence: 99%

Towards Making the Most of Dialogue Characteristics for Neural Chat Translation

Liang¹,

Zhou²,

Meng³

et al. 2021

Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

View full text Add to dashboard Cite

Neural Chat Translation (NCT) aims to translate conversational text between speakers of different languages. Despite the promising performance of sentence-level and context-aware neural machine translation models, there still remain limitations in current NCT models because the inherent dialogue characteristics of chat, such as dialogue coherence and speaker personality, are neglected. In this paper, we propose to promote the chat translation by introducing the modeling of dialogue characteristics into the NCT model. To this end, we design four auxiliary tasks including monolingual response generation, cross-lingual response generation, next utterance discrimination, and speaker identification. Together with the main chat translation task, we optimize the NCT model through the training objectives of all these tasks. By this means, the NCT model can be enhanced by capturing the inherent dialogue characteristics, thus generating more coherent and speaker-relevant translations. Comprehensive experiments on four language directions (English⇔German and English⇔Chinese) verify the effectiveness and superiority of the proposed approach. * Equal contribution. Work was done when Liang and Zhou were interning at Pattern Recognition Center, WeChat

show abstract

“…We follow previous work (Wang et al, 2017;Xu et al, 2019;Huang et al, 2020) to optimize the utterance-pair coherence scoring model (described in Section 3.2) with marginal ranking loss. Formally, the coherence scoring model CS receives two utterances (u 1 , u 2 ) as input and return the coherence score c = CS(u 1 , u 2 ), which reflects the topical relevance of this pair of utterances.…”

Section: Training Data For Coherence Scoringmentioning

confidence: 99%

Improving Unsupervised Dialogue Topic Segmentation with Utterance-Pair Coherence Scoring

Xing,

Carenini

2021

Preprint

View full text Add to dashboard Cite

Dialogue topic segmentation is critical in several dialogue modeling problems. However, popular unsupervised approaches only exploit surface features in assessing topical coherence among utterances. In this work, we address this limitation by leveraging supervisory signals from the utterance-pair coherence scoring task. First, we present a simple yet effective strategy to generate a training corpus for utterance-pair coherence scoring. Then, we train a BERT-based neural utterance-pair coherence model with the obtained training corpus. Finally, such model is used to measure the topical relevance between utterances, acting as the basis of the segmentation inference 1 . Experiments on three public datasets in English and Chinese demonstrate that our proposal outperforms the state-of-the-art baselines.

show abstract

GRADE: Automatic Graph-Enhanced Coherence Metric for Evaluating Open-Domain Dialogue Systems

Cited by 38 publications

References 21 publications

Towards Quantifiable Dialogue Coherence Evaluation

Towards Quantifiable Dialogue Coherence Evaluation

Towards Making the Most of Dialogue Characteristics for Neural Chat Translation

Improving Unsupervised Dialogue Topic Segmentation with Utterance-Pair Coherence Scoring

Contact Info

Product

Resources

About