Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop 2020
DOI: 10.18653/v1/2020.acl-srw.31
|View full text |Cite
|
Sign up to set email alerts
|

Building a Japanese Typo Dataset from Wikipedia’s Revision History

Abstract: User generated texts contain many typos for which correction is necessary for NLP systems to work. Although a large number of typo-correction pairs are needed to develop a data-driven typo correction system, no such dataset is available for Japanese. In this paper, we extract over half a million Japanese typo-correction pairs from Wikipedia's revision history. Unlike other languages, Japanese poses unique challenges: (1) Japanese texts are unsegmented so that we cannot simply apply a spelling checker, and (2) … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1

Citation Types

0
2
1

Year Published

2021
2021
2023
2023

Publication Types

Select...
2
1
1

Relationship

1
3

Authors

Journals

citations
Cited by 4 publications
(3 citation statements)
references
References 10 publications
0
2
1
Order By: Relevance
“…Previous studies successfully used the Levenshtein distance to extract misspelling-correction pairs from GitHub's commit logs (Hagiwara and Mita, 2020) and Wikipedia's revision history (Tanaka et al, 2020). Although this may seem to contradict with our findings, these successes are reasonable because the text domains explored in those studies are substantially different from search query logs (Appendix E).…”
Section: Related Workcontrasting
confidence: 61%
“…Previous studies successfully used the Levenshtein distance to extract misspelling-correction pairs from GitHub's commit logs (Hagiwara and Mita, 2020) and Wikipedia's revision history (Tanaka et al, 2020). Although this may seem to contradict with our findings, these successes are reasonable because the text domains explored in those studies are substantially different from search query logs (Appendix E).…”
Section: Related Workcontrasting
confidence: 61%
“…We leverage these edits to construct error-correction word dictionaries (later used to create noisy test data). Our approach to mining edits is similar to Tanaka et al (2020), but we consider multiple languages (as opposed to only Japanese), and additionally create dictionaries of word-level edits.…”
Section: Wiki-edit Miningmentioning
confidence: 99%
“…Finally for Kanji we re-use the criteria of Tanaka et al (2020) as we re-use their dataset of sentence pairs: checking if the two sentences (containing Kanji) have the same reading. Table 9 lists the number of correct/incorrect pairs (in Millions) used for noise dictionaries to create the test-sets for the various languages ( §3).…”
Section: B Chinese and Japanese Edit Miningmentioning
confidence: 99%