This work introduces an automatic classification system for measuring the complexity level of a given Italian text under a linguistic point-of-view. The task of measuring the complexity of a text is cast to a supervised classification problem by exploiting a dataset of texts purposely produced by linguistic experts for second language teaching and assessment purposes. The commonly adopted Common European Framework of Reference for Languages (CEFR) levels were used as target classification classes, texts were elaborated by considering a large set of numeric linguistic features, and an experimental comparison among ten widely used machine learning models was conducted. The results show that the proposed approach is able to obtain a good prediction accuracy, while a further analysis was conducted in order to identify the categories of features that influenced the predictions.
The selection of texts for second language learning purposes typically relies on teachers' and test developers' individual judgment of the observable qualitative properties of a text. Little or no consideration is generally given to the quantitative dimension within an evidence-based framework of reproducibility. This study aims to fill the gap by evaluating the effectiveness of an automatic tool trained to assess text complexity in the context of Italian as a second language learning. A dataset of texts labeled by expert test developers was used to evaluate the performance of three classifier models (decision tree, random forest, and support vector machine), which were trained using linguistic features measured quantitatively and extracted from the texts. The experimental analysis provided satisfactory results, also in relation to which kind of linguistic trait contributed the most to the final outcome.
L’articolo descrive la progettazione, la realizzazione e le caratteristiche di un nuovo learner corpus di italiano L2: il corpus CELI. Il corpus raccoglie, sistematicamente, le produzioni scritte di apprendenti di italiano L2 che hanno superato le prove per la certificazione linguistica CELI dell’Università per Stranieri di Perugia in relazione ai livelli B1, B2, C1 e C2. Il corpus contiene 3041 testi prodotti da altrettanti apprendenti, con una distribuzione bilanciata dei token per livello di competenza. I metadati associati a ciascun testo includono: il genere, la data di nascita, il numero di matricola e la nazionalità dell’apprendente; il livello di competenza, secondo il QCER, relativo alla prova superata dall’apprendente; il punteggio totale assegnato all’intera prova d’esame, il punteggio totale assegnato alla componente scritta dell’esame, il punteggio specifico assegnato alla singola produzione scritta, con i relativi punteggi assegnati alla competenza lessicale, alla competenza grammaticale, alla competenza sociolinguistica, e alla coerenza e coesione del testo prodotto; il numero identificativo della traccia utilizzata per la produzione di ciascun testo, con l’indicazione del genere (lettera, e-mail, blog, racconto, articolo e relazione) e della tipologia (argomentativa, descrittiva e narrativa, o mista: descrittiva-narrativa; argomentativa-narrativa; argomentativa-descrittiva; argomentativa-narrativa-descrittiva) del testo prodotto. Il corpus CELI si presta a numerosi utilizzi sia sul versante della ricerca linguistico-acquisizionale, soprattutto in chiave pseudo-longitudinale, sia sul versante della pianificazione didattica, dello sviluppo di attività didattiche e del language testing. The CELI corpus: a new resource for studying Italian L2 acquisition The article illustrates the design, the development and the characteristics of a new learner corpus of Italian L2: the CELI corpus. The corpus systematically collects the written texts produced by learners of Italian L2 who have passed the CELI exams administered by University for Foreigners of Perugia in relation to proficiency levels B1, B2, C1 and C2. The corpus contains 3041 texts produced by the same number of learners, with a balanced distribution of the tokens in terms of proficiency level. The metadata associated with each text include: gender, date of birth, student ID code and nationality of the learner; CEFR proficiency level, related to the exam passed by the learner; the score assigned to the entire exam, to the entire written component of the exam, and to the single written task together with scores pertaining to lexical, grammatical and sociolinguistic competence and to the cohesion and coherence of the produced text; the ID number related to the task used to produce each text, with the indication of text genre (letter, e-mail, blog, story, article and report) and text type (argumentative, descriptive and narrative, or mixed: descriptive-narrative; argumentative-narrative; argumentative-descriptive; argumentative-narrative-descriptive). The CELI corpus lends itself to numerous uses both in the domain of second language acquisition research, particularly with regard to pseudo-longitudinal research designs, and in the domain of pedagogical planning, pedagogical materials design and language testing.
Learning to pronounce a written word implies assigning a stress pattern to that word. This task can present a challenge for speakers of languages like Italian, in which stress information must often be computed from distributional properties of the language, especially for individuals learning Italian as a second language (L2). Here, we aimed to characterize the processes underlying the development of stress assignment in native English and native Chinese speakers learning L2 Italian. Both types of bilinguals produced evidence supporting a role of vocabulary size in modulating the type of distributional information used in stress assignment, with an early bias for Italian's dominant stress pattern being gradually replaced by use of associations between orthographic sequences and stress patterns in more advanced bilinguals. We also obtained some evidence for a transfer of stress assignment habits from the bilinguals’ native language to Italian, although only in English native speakers.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.