We present a new multi-layered annotation scheme for orthographic errors in freely written German texts produced by primary school children. The scheme is closely linked to the German graphematic system and defines categories for both general structural word properties and errorrelated properties. Furthermore, it features multiple layers of information which can be used to evaluate an error. The categories can also be used to investigate properties of correctly-spelled words, and to compare them to the erroneous spellings. For data representation, we propose the XML-format LearnerXML.
This paper presents the automatic annotation of orthographic properties of German words and spelling errors in texts of German primary school children according to a new multi-layered annotation scheme [1]. The scheme is closely linked to the principles of the German writing system and is supposed to allow the pursuit of new research questions concerning the relationship between spelling errors of competent and less competent spellers and the regularities of the German graphematic system. A novelty of the automatic annotation is that it takes an intended, correctly spelled word as input and applies a set of rules to generate a list of error candidates containing systematic spelling errors. As a further novelty, the annotation of additional word-and error-related properties is presented such as whether the spelling error changes the word's pronunciation and whether a spelling can be derived from a related word form. This gives rise to more detailed analyses of the errors but also allows us to develop an application for learners that generates automatic advice for the correct spelling. A first evaluation shows that the automatic annotation of the presented categories and features can come close to human annotations.
NLP applications for learners often rely on annotated learner corpora. Thereby, it is important that the annotations are both meaningful for the task, and consistent and reliable. We present a new longitudinal L1 learner corpus for German (handwritten texts collected in grade 2-4), which is transcribed and annotated with a target hypothesis that strictly only corrects orthographic errors, and is thereby tailored to research and tool development for orthographic issues in primary school. While for most corpora, transcription and target hypothesis are not evaluated, we conducted a detailed inter-annotator agreement study for both tasks. Although we achieved high agreement, our discussion of cases of disagreement shows that even with detailed guidelines, annotators differ here and there for different reasons, which should also be considered when working with transcriptions and target hypotheses of other corpora, especially if no explicit guidelines for their construction are known.
This paper proposes a tool for the automatic analysis of spelling errors in freely written German texts. It is based on automatic annotations of spelling errors that comprise various levels, such as linguistic properties of the target word (phonemes, syllables, morphemes) and error-related properties such as error categories which mark whether the misspelling changes the pronunciation of a word or whether the correct spelling can be derived from a related word form. These can be used to create an application that could, for example, help teachers analyze their students' orthographic skills and give feedback with little manual effort. For the future, it could also be implemented as an automatic tutoring system for children in which case the surface has to be child-oriented and should present error analysis as a kind of game. While the paper presents the capabilities of a first prototype, the concrete implementation for real-world use is open for discussion with experts on orthography instruction.
Compared to early language development, later changes to the language system during orthography and literacy acquisition have not yet been researched in detail. We present a longitudinal corpus of texts on short picture stories written by German primary school children between grades 2 and 4 and grades 3 and 4. It includes 1,922 texts with 212,505 tokens (6,364 types) from 251 children. For each text, rich metadata is available, including age, grade and linguistic background (at least 60% of the children were multilingual). To our knowledge, our corpus is the largest longitudinal corpus of written texts by children at primary school age. Each word is included in its original spelling as well as in a normalized form (target hypothesis), specifying the intended word form, which we corrected for orthographic but not grammatical errors. Original and target word forms are aligned characterwise and the target word forms are enriched with phonological, syllabic, and morphological information. Additionally, for each target word form, we established key lexical variables, e.g., word frequency or summed bigram frequency, as specified in childLex. Where applicable, we also specify key features of German orthography (e.g., consonant doubling, vowellengthening
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.