This article focuses on the analysis of schoolchildren’s writing (throughout the whole primary school period) using sets of morphological labels (n-grams). We analyzed the sets of bigrams and trigrams from a group of literary texts written by Catalan schoolchildren in order to identify which bigrams and trigrams can help discriminate between texts from the three cycles into which the Spanish primary education system is divided: lower cycle (6- and 7-year-olds), middle cycle (8- and 9-year- olds) and upper cycle (10- and 11-year-olds). The results obtained are close to 70% of correct classifications (77.5% bigrams and 68.6% trigrams), making this technique useful for automatic document classification by age.
Resumen. En disputas legales por posible plagio se requiere la pericia de un lingüista forense. Los estudios en detección de plagio han establecido un umbral máximo del 50% de similitud léxica en textos producidos de forma independiente. En este artículo, se investiga la posibilidad de que los artículos periodísticos requieran un umbral propio puesto que parten de un mismo contenido informativo ("qué", "quién", "cuándo", "dónde", "cómo" y "por qué"). Para ello, se aplican 4 variables lingüísticas cuantitativas a dos corpus estructurados alrededor de 10 temas: un corpus de estudio formado por 50 artículos y un corpus de caso con 20 textos provenientes de un caso real. A partir del primero, se extraen umbrales para cada variable que reflejan los porcentajes de coincidencia esperables entre textos independientes. Estos umbrales se aplican después al corpus del caso para determinar si los nuevos umbrales permiten detectar todos los casos de plagio.
The objective of this study is to characterize writing samples in Catalan written by boys and girls in primary school (from seven to 12 years old) using syntactic patterns. The corpus contains 169 writings divided by sex (76 boys and 93 girls) with an average of 200 words and a total length of 33,763 words. From this corpus, we calculated the 40 n-grams of the most frequent morphological categories (bigrams, trigrams). The data were statistically analysed using ANOVA and Linear Discriminant Analysis, and the accuracy in predicting the writer's gender in a cross-validation experiment was 60.4% using both bigrams and trigrams. When the children's age was taken into account, the percentage of accuracy was higher than 70% in both the original classification and the crossvalidation. The identification of the most discriminating bigrams and trigrams allowed us to determine that girls show a greater expressive capacity and superior syntactic maturity, and greater lexical and syntactic richness.
This article deals with a forensic linguistics case study of the determination of the level of a B1 English multiple-choice test that was challenged in court by numerous candidates on the grounds that it was not of the appropriate level. A control corpus comprising 240 analogous multiple-choice questions from B1 exams aligned with the Common European Framework of Reference for Languages (CEFR) was compiled in order to establish a threshold for the percentage of questions of a level higher than that being tested which can be expected in such exams. The analysis was carried out following a combination of qualitative and quantitative methods, with the help of the tool English Profile, which provides Reference Level Descriptions (RLDs) for the English language within the CEFR. The results of the analysis of the control corpus established a baseline of 5 to 7% of questions that include key items classified as higher than B1, while the percentage was 68% in the case of the disputedexam. Thus, the present study proposes a further application of the tool English Profile within the field of forensic linguistics and puts forward the concept of Level Appropriateness Threshold (LAT), analogous to other thresholds established in forensic linguistics, which can serve as a baseline for determining the appropriateness of B1 English multiple-choice exams and a model for other levels and skill areas.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.