A Survey of Evaluation Metrics Used for NLG Systems

Sai, Ananya B.; Mohankumar, Akash Kumar; Khapra, Mitesh M.

doi:10.1145/3485766

Cited by 72 publications

(44 citation statements)

References 111 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…For the evaluation, we developed an automated framework and utilized both automatic and human-based rankings. We used popular metrics, such as BLEU [25] and ROUGE [26,27], for automatic evaluation. These metrics are widely used for Natural Language Generation (NLG) tasks, including AQG, as they calculate the n-gram similarity between the reference sentence and the generated questions.…”

Section: Experimental Analysis and Evaluation Resultsmentioning

confidence: 99%

Rule-Based Automatic Question Generationusing Dependency Parsing

Sewunetie

Kovács

2023

Preprint

View full text Add to dashboard Cite

Artificial intelligence systems offer valuable support to the education sector by automating routine tasks for instructors and providingadaptive assessments. These assessments are a crucial component ofautomatic question-generation techniques. There are various techniquesfor question generation described in the literature. This paper introduces a novel rule-based system for automatic question generationthat employs dependency parsing techniques. The proposed methodfocuses on analyzing both the syntactic and semantic structure of asentence. The paper includes both manual and automatic evaluationmetrics to assess the questions generated by our proposed system. Inthe evaluation, our system received an overall score of 3.67 out of5.0 for human-level performance, an average BlEU−N score of 0.718,and F1−score scores above 0.5 for all types of ROUGE. Both humanand automatic evaluation results demonstrate that our proposed system achieves good performance on simple and short sentences. In thefuture, we plan to enhance our model by incorporating phrase-levelparsing to further improve the proposed dependency parsing techniques.

show abstract

Section: Experimental Analysis and Evaluation Resultsmentioning

confidence: 99%

Rule-Based Automatic Question Generationusing Dependency Parsing

Sewunetie

Kovács

2023

Preprint

View full text Add to dashboard Cite

show abstract

“…Human assessments provide a complete picture of response generating performance, particularly in the generation of humanlike response [60]. We leverage the human evaluation questions from [8], which covers engaging, interesting, humanlike, and knowledgeable, for response generation assessments.…”

Section: ) Human Evaluationsmentioning

confidence: 99%

ToD4IR: A Humanised Task-Oriented Dialogue System for Industrial Robots

Zhang

Chrysostomou

et al. 2022

IEEE Access

View full text Add to dashboard Cite

Despite the fact that task-oriented conversation systems have received much attention from the dialogue research community, only a handful of them have been studied in a real-world manufacturing context using industrial robots. One stumbling block is the lack of a domain-specific discourse corpus for training these systems. Another difficulty is that earlier attempts to integrate natural language interfaces (such as chatbots) into the industrial sector have primarily focused on task completion rates. When designing a dialogue system for social robots, the user experience is prioritized above industrial robots. We provide the Industrial Robots Domain Wizard-of-Oz dataset (IRWoZ) to overcome these challenges, a fully-labeled discourse dataset covering four robotics domains. It delivers simulated discussions between shop floor workers and industrial robots, with over 401 dialogues, to promote language-assisted Human-Robot Interaction (HRI) in industrial settings. Small talk concepts and human-to-human conversation strategies are provided to support humanlike answer generation and give a more natural and adaptable dialogue environment to increase user experience and engagement. Finally, we propose and evaluate an end-to-end Task-oriented Dialogue for Industrial Robots (ToD4IR) using two types of pre-trained backbone models: GPT-2 and GPT-Neo, on the IRWoZ dataset. ToD4IR's performance in a real manufacturing context was validated through a series of trials. Our experiments demonstrate that ToD4IR outperforms three downstream task-oriented dialogue tasks, i.e., dialogue state tracking, dialogue act generation, and response generation, on the IRWoZ dataset. Our source code of ToD4IR and the IRWoZ dataset is accessible at https://github.com/lcroy/ToD4IR for reproducible research.

show abstract

“…A total of 18 automatic metrics are tested against statistics produced by the human judgements of our criteria: post-edit times, number of incorrect statements, and number of omissions. Following the taxonomies reported by Celikyilmaz et al (2020) andSai et al (2020), the metrics considered can be loosely grouped in:…”

Section: Correlation With Automatic Metricsmentioning

confidence: 99%

Human Evaluation and Correlation with Automatic Metrics in Consultation Note Generation

Moramarco¹,

Korfiatis²,

Perera³

et al. 2022

Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

View full text Add to dashboard Cite

In recent years, machine learning models have rapidly become better at generating clinical consultation notes; yet, there is little work on how to properly evaluate the generated consultation notes to understand the impact they may have on both the clinician using them and the patient's clinical safety.To address this we present an extensive human evaluation study of consultation notes where 5 clinicians (i) listen to 57 mock consultations, (ii) write their own notes, (iii) post-edit a number of automatically generated notes, and (iv) extract all the errors, both quantitative and qualitative. We then carry out a correlation study with 18 automatic quality metrics and the human judgements. We find that a simple, character-based Levenshtein distance metric performs on par if not better than common model-based metrics like BertScore. All our findings and annotations are open-sourced. Related WorkNote Generation has been in the focus of the academic community with both extractive methods (

show abstract

A Survey of Evaluation Metrics Used for NLG Systems

Cited by 72 publications

References 111 publications

Rule-Based Automatic Question Generationusing Dependency Parsing

Rule-Based Automatic Question Generationusing Dependency Parsing

ToD4IR: A Humanised Task-Oriented Dialogue System for Industrial Robots

Human Evaluation and Correlation with Automatic Metrics in Consultation Note Generation

Contact Info

Product

Resources

About