The current study presents the Czech General Service List (CGSL), which was designed to capture the core vocabulary of written and spoken Czech which is useful to Czech as a second language learners (CSLLs). The CGSL is a result of robust comparison of five Czech language corpora (SYN2020, csTenTen17, Koditex, ORALv1, and ORTOFONv2) containing over 12 billion running words. These five corpora represent a variety of corpus sizes, designs, and text types of both written and spoken Czech. This study investigates the overlap between the top 10,000 words in these corpora based on their normalized average reduced frequency (ARFn), which is a measure that takes into consideration both frequency and dispersion. This study also investigates the overlap and rank correlation between words from the written and spoken corpora, respectively. Significant differences were found between words used in written and spoken Czech, so the CGSL was built to contain three types of words: 1) core words of Czech, 2) core words of written Czech, and 3) core words of spoken Czech. Finally, this study compared the words on the CGSL to words on pedagogical wordlists from Czech textbooks designed for L1 English speaking CSLLs and found there to be significant differences between the two. This suggests that future CSL materials informed by the CGSL might have a different effect on Czech learning than the currently existing CSL materials.
CHAPTER 1. INTRODUCTIONCzech is an important world language which presents unique challenges for learners who speak English as a first language (L1), including the acquisition of new vocabulary. This study opens a scholarly discussion about how corpus-based methods and data can help Czech as a second language learners (CSLLs) to prioritize core vocabulary items from written and spoken Czech. The theories, questions, and methodologies used in this study are also applicable to other less-commonly taught languages which currently lack corpus-based pedagogical materials and approaches to learning.According to Nation, a leading scholar on second language (L2) vocabulary research and a major contributor to the literature on this topic (Coxhead, 2010), the first goal of L2 vocabulary learning should be to know which vocabulary is useful for learners (Nation, 2013). Recent pedagogically motivated vocabulary research connects usefulness for learners with frequency information as opposed to native speaker intuition (Alderson, 2007;Gardner & Davies, 2014;Garnier & Schmitt, 2015;Lei & Liu, 2016). Receptive and productive knowledge of L2 words correlate with word frequency, and frequent words are encountered and used more often by learners (Ellis, 2014;Garnier & Schmitt, 2015).Brezina and Gablasova (2015) agree that while "word frequency alone is not a reliable measure for selecting words important for learners" (p. 3), it is possible to quantify a word's importance for learners by measuring its frequency, dispersion within a corpus, and dispersion across multiple large language corpora representing a variety of situational ...