Kostiantyn Omelianchuk scite author profile

In this paper, we present a simple and efficient GEC sequence tagger using a Transformer encoder. Our system is pre-trained on synthetic data and then fine-tuned in two stages: first on errorful corpora, and second on a combination of errorful and error-free parallel corpora. We design custom token-level transformations to map input tokens to target corrections. Our best single-model/ensemble GEC tagger achieves an F 0.5 of 65.3/66.5 on CoNLL-2014 (test) and F 0.5 of 72.4/73.6 on BEA-2019 (test). Its inference speed is up to 10 times as fast as a Transformer-based seq2seq GEC system. The code and trained models are publicly available 1 .

show abstract

GECToR -- Grammatical Error Correction: Tag, Not Rewrite

Omelianchuk¹,

Atrasevych²,

Chernodub³

et al. 2020

Preprint

View full text Add to dashboard Cite

show abstract

Ensembling and Knowledge Distilling of Large Sequence Taggers for Grammatical Error Correction

Maksym¹,

Chernodub²,

Omelianchuk³

2022

View full text Add to dashboard Cite

How do you correct run-on sentences it’s not as easy as it seems

Zheng¹,

Napoles

Tetreault

et al. 2018

View full text Add to dashboard Cite

Run-on sentences are common grammatical mistakes but little research has tackled this problem to date. This work introduces two machine learning models to correct run-on sentences that outperform leading methods for related tasks, punctuation restoration and wholesentence grammatical error correction. Due to the limited annotated data for this error, we experiment with artificially generating training data from clean newswire text. Our findings suggest artificial training data is viable for this task. We discuss implications for correcting run-ons and other types of mistakes that have low coverage in error-annotated corpora.

show abstract

Ensembling and Knowledge Distilling of Large Sequence Taggers for Grammatical Error Correction

Maksym¹,

Chernodub²,

Omelianchuk³

2022

Preprint

View full text Add to dashboard Cite

In this paper, we investigate improvements to the GEC sequence tagging architecture with a focus on ensembling of recent cutting-edge Transformer-based encoders in Large configurations. We encourage ensembling models by majority votes on span-level edits because this approach is tolerant to the model architecture and vocabulary size. Our best ensemble achieves a new SOTA result with an F 0.5 score of 76.05 on BEA-2019 (test), even without pretraining on synthetic datasets. In addition, we perform knowledge distillation with a trained ensemble to generate new synthetic training datasets, "Troy-Blogs" and "Troy-1BW". Our best single sequence tagging model that is pretrained on the generated Troy-datasets in combination with the publicly available synthetic PIE dataset achieves a near-SOTA 1 result with an F 0.5 score of 73.21 on BEA-2019 (test). The code, datasets, and trained models are publicly available. 2 * This research was performed during Maksym Tarnavskyi's work on Ms.Sc. thesis at Ukrainian Catholic University (Tarnavskyi, 2021).1 To the best of our knowledge, our best single model gives way only to much heavier T5 model (Rothe et al., 2021).

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

customersupport@researchsolutions.com

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.