2020
DOI: 10.1075/ijlcr.20009.sha
|View full text |Cite
|
Sign up to set email alerts
|

Refining and modifying the EFCAMDAT

Abstract: This report outlines the development of a new corpus, which was created by refining and modifying the largest open-access L2 English learner database – the EFCAMDAT. The extensive data-curation process, which can inform the development and use of other corpora, included procedures such as converting the database from XML to a tabular format, and removing problematic markup tags and non-English texts. The final dataset contains two corresponding samples, written by similar learners in response to different prom… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1

Citation Types

0
3
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
5
1
1

Relationship

0
7

Authors

Journals

citations
Cited by 10 publications
(3 citation statements)
references
References 8 publications
0
3
0
Order By: Relevance
“…Two L2 written corpora were used in the present study: MERLIN [77] and the EF Cambridge Open Language Database (EFCAMDAT) [78][79][80]. MERLIN includes texts written by L2 Czech, German and Italian speakers, while EFCAMDAT covers texts written by L2 English speakers.…”
Section: Corpusmentioning
confidence: 99%
See 1 more Smart Citation
“…Two L2 written corpora were used in the present study: MERLIN [77] and the EF Cambridge Open Language Database (EFCAMDAT) [78][79][80]. MERLIN includes texts written by L2 Czech, German and Italian speakers, while EFCAMDAT covers texts written by L2 English speakers.…”
Section: Corpusmentioning
confidence: 99%
“…As mentioned above, EFCAMDAT essays were all persuasive essays. EFCAMDAT is semi-longitudinal and contains several texts written by the same L2 learner [78]. This study selected only one text per L2 learner to reduce idiosyncratic complexity features.…”
Section: Corpusmentioning
confidence: 99%
“…2 Datasets and setup 2.1 EFCAMDAT Firstly, we use the EFCAMDAT corpus (Geertzen et al, 2014) that comprises L2 learners' scripts annotated with their respective score on a scale from 0 to 100, their proficiency level from 1 to 16 (mapped to CEFR levels from A1 to C2) and partially error-tagged by human experts. As our work investigates the efficacy of errors as features, we only use the error-tagged section of the EFCAM-DAT Cleaned Subcorpus (Shatz, 2020), consisting of 498,208 scripts ranging from proficiency level 1 to 15 (i.e. from A1 to C1), which we divided into training and test set.…”
Section: Reference To Prior Workmentioning
confidence: 99%