Proceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications 2019
DOI: 10.18653/v1/w19-4449
|View full text |Cite
|
Sign up to set email alerts
|

The Unbearable Weight of Generating Artificial Errors for Grammatical Error Correction

Abstract: In recent years, sequence-to-sequence models have been very effective for end-to-end grammatical error correction (GEC). As creating human-annotated parallel corpus for GEC is expensive and time-consuming, there has been work on artificial corpus generation with the aim of creating sentences that contain realistic grammatical errors from grammatically correct sentences. In this paper, we investigate the impact of using recent neural models for generating errors to help neural models to correct errors. We condu… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1

Citation Types

0
3
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
4
1

Relationship

0
5

Authors

Journals

citations
Cited by 7 publications
(3 citation statements)
references
References 10 publications
0
3
0
Order By: Relevance
“…We discuss available datasets in Section 4.1 but it is important to note the role of synthetic data generation for GEC model training. Synthetic data has been used for GEC for a long time (Foster and Andersen, 2009;Brockett et al, 2006), and recent research shows that it can lead to significant performance gains (Stahlberg and Kumar, 2021;Htut and Tetreault, 2019). Approaches for synthetic data generation include character perturbations, dictionary-or edit-distance based replacements, shuffling word order, rule-based suffix transformations, and more (Grundkiewicz et al, 2019;Awasthi et al, 2019a;Náplava and Straka, 2019;Rothe et al, 2021b).…”
Section: Related Workmentioning
confidence: 99%
“…We discuss available datasets in Section 4.1 but it is important to note the role of synthetic data generation for GEC model training. Synthetic data has been used for GEC for a long time (Foster and Andersen, 2009;Brockett et al, 2006), and recent research shows that it can lead to significant performance gains (Stahlberg and Kumar, 2021;Htut and Tetreault, 2019). Approaches for synthetic data generation include character perturbations, dictionary-or edit-distance based replacements, shuffling word order, rule-based suffix transformations, and more (Grundkiewicz et al, 2019;Awasthi et al, 2019a;Náplava and Straka, 2019;Rothe et al, 2021b).…”
Section: Related Workmentioning
confidence: 99%
“…Training GEC models is difficult due to the natural lack of suitable training data and possible erroneous corrections, so synthetic data becomes a crucial part of any GEC pipeline (Choe et al, 2019;Stahlberg and Kumar, 2021;Htut and Tetreault, 2019). It had been used for GEC even before the deep learning era that required larger datasets (Foster and Andersen, 2009;Brockett et al, 2006).…”
Section: Synthetic Datamentioning
confidence: 99%
“…Experiment results showed that ungrammatical sentences generated by multi-layer CNN and Transformer were more beneficial to training GEC systems. H owever, adding too much artificial data would impact the performance [122]. So, it is important to oversample the authentic training data to maintain the balance [91].…”
Section: Back Translationmentioning
confidence: 99%