Findings of the Association for Computational Linguistics: EMNLP 2020 2020
DOI: 10.18653/v1/2020.findings-emnlp.26
|View full text |Cite
|
Sign up to set email alerts
|

A Self-Refinement Strategy for Noise Reduction in Grammatical Error Correction

Abstract: Existing approaches for grammatical error correction (GEC) largely rely on supervised learning with manually created GEC datasets. However, there has been little focus on verifying and ensuring the quality of the datasets, and on how lower-quality data might affect GEC performance. We indeed found that there is a non-negligible amount of "noise" where errors were inappropriately edited or left uncorrected. To address this, we designed a self-refinement method where the key idea is to denoise these datasets by … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1

Citation Types

0
6
0
1

Year Published

2021
2021
2023
2023

Publication Types

Select...
3
2
2

Relationship

1
6

Authors

Journals

citations
Cited by 9 publications
(7 citation statements)
references
References 39 publications
0
6
0
1
Order By: Relevance
“…A common problem in GEC is that the largest publicly-available high-quality parallel corpora only contain roughly 50k sentence pairs, and larger corpora, such as Lang-8, are noisy (Mita et al 2020;Rothe et al 2021). This data sparsity problem has motivated a lot of research into synthetic data generation, especially in the context of resourceheavy NMT approaches, because synthetic data primarily requires a native monolingual source corpus rather than a labour-intensive manual annotation process.…”
Section: Data Augmentationmentioning
confidence: 99%
See 1 more Smart Citation
“…A common problem in GEC is that the largest publicly-available high-quality parallel corpora only contain roughly 50k sentence pairs, and larger corpora, such as Lang-8, are noisy (Mita et al 2020;Rothe et al 2021). This data sparsity problem has motivated a lot of research into synthetic data generation, especially in the context of resourceheavy NMT approaches, because synthetic data primarily requires a native monolingual source corpus rather than a labour-intensive manual annotation process.…”
Section: Data Augmentationmentioning
confidence: 99%
“…One direction focuses on correcting noisy sentences. Mita et al (2020) and Rothe et al (2021) achieve this by incorporating a well-trained GEC model to reduce wrong corrections. The other direction attempts to down-weight noisy sentences.…”
Section: Augmenting Official Datasetsmentioning
confidence: 99%
“…Therefore, it is necessary to filter out the error-free spans. Mita et al (2020) compare the perplexities of generated sentences and correct sentences to determine whether the generated sentences are grammatically correct. However, since the sentence-level perplexity is affected by too many tokens, the sentence with larger perplexity may also be grammatically correct.…”
Section: Erroneous Sentence Constructionmentioning
confidence: 99%
“…Key Words: Grammatical Error Correction, Evaluation Corpus, Error Tag, Japanese 1 はじめに 文法誤り訂正とは,与えられた文章中の文法誤りを文法的に正しい表現に訂正するタスクで ある.主に語学学習者が書いた文章を対象とし,自然言語処理の教育応用における主要タスク のひとつとなっている.これまでルールに基づく手法 (Schneider and McCoy 1998) や言語モデ ルに基づく手法 (Gamon et al 2008),分類器に基づく手法 (Dahlmeier and Ng 2011) などが開 発されてきた.近年では機械翻訳に基づく手法 (Brockett et al 2006) が盛んに研究されている (Chollampatt and Ng 2018;Junczys-Dowmunt et al 2018;Zhao et al 2019;Lichtarge et al 2019;Kiyono et al 2020;Kaneko et al 2020;Rothe et al 2021; Kiyono et al 2020;Kaneko et al 2020;Yasunaga et al 2021;Lai et al 2022).具体的には CoNLL-2014 shared task 評価コーパスに加え,FCE (Yannakoudakis et al 2011) や JFLEG (Napoles et al 2017),W&I+LOCNESS (Granger 1998;Yannakoudakis et al 2018),GMEG (Napoles et (Xie et al 2018;Ge et al 2018a;Zhao et al 2019;Lichtarge et al 2019Lichtarge et al , 2020Kiyono et al 2020;Wang and Zheng 2020;Zhou et al 2020;Wan et al 2020;Stahlberg and Kumar 2021;Yasunaga et al 2021;…”
unclassified