We present word frequencies based on subtitles of British television programmes. We show that the SUBTLEX-UK word frequencies explain more of the variance in the lexical decision times of the British Lexicon Project than the word frequencies based on the British National Corpus and the SUBTLEX-US frequencies. In addition to the word form frequencies, we also present measures of contextual diversity part-of-speech specific word frequencies, word frequencies in children programmes, and word bigram frequencies, giving researchers of British English access to the full range of norms recently made available for other languages. Finally, we introduce a new measure of word frequency, the Zipf scale, which we hope will stop the current misunderstandings of the word frequency effect.
One of the most important predictors of word processing times is the frequency with which words have been encountered. In large-scale studies, word frequency (WF) reliably explains the largest percentage of variance of any predictor of word processing times (e.g., Baayen, Feldman, & Schreuder, 2006;Balota, Cortese, Sergent-Marshall, Spieler, & Yap, 2004; Yap & Balota, 2009). Therefore, psycholinguists have invested time in the collection of WF measures. The first list of word frequencies widely used in language research was published in English by Thorndike and Lorge (1944; see Bontrager, 1991, for a review of older frequency lists including German ones). Its main motivation was educational (helping teachers decide which words should be taught to pupils). A few decades later, Ku era and Francis (1967; KF) published a list (also for American English) that would become the frequency measure of choice for language researchers up to the present (Brysbaert & New, 2009).For the Dutch language, van Berckel, Brandt Corstius, Mokken, and van Wijngaarden (1965) collected word frequencies based on a newspaper corpus of about 50,000 words. Although this list contained additional statistical information, such as ngram sequences up to three letters, about the Dutch language, it did not gain wide adoption. The first publicly available frequency list for Dutch was edited by Uit den Boogaart (1975), who published frequencies of "written and spoken Dutch" based on a corpus of 605,733 words from written sources and 121,569 words from spoken sources. This book was superseded in 1993, when the Centre for Lexical Information (CELEX) published frequencies based on a 42-million-word corpus of written texts collected by the Institute for Dutch Lexicology (Baayen, Piepenbrock, & van Rijn, 1993). In addition to the frequencies of the different forms (e.g., play, plays), the CELEX database also contained the frequencies of the words as different parts of speech ( play as a noun vs. play as a verb) and the frequencies of the headwords or lemmas (e.g., the frequency of the nominal lemma play consisting of the summed frequency of the word form play as a noun and the word form plays as a noun). Since its publication, CELEX has been the primary source of word frequencies and other lexical information for the Dutch language. 1 For a long time, face validity was the main factor in assessing the quality of a frequency measure for research in word recognition. Two criteria were of importance: the representativeness of the sources and the size of the corpus. On both criteria, CELEX scored well. Special care had been taken to select texts from a wide variety of documents produced by the Dutch-speaking community, and the size of the corpus was larger than what was available in most other languages. However, in the past 2 years, researchers have started to measure the validity of word frequencies for research into word recognition processes by correlating them with word processing times for thousands of words. This research has revealed considerable qu...
We present a new database of lexical decision times for English words and nonwords, for which two groups of British participants each responded to 14,365 monosyllabic and disyllabic words and the same number of nonwords for a total duration of 16 h (divided over multiple sessions). This database, called the British Lexicon Project (BLP), fills an important gap between the Dutch Lexicon Project (DLP; Keuleers, Diependaele, & Brysbaert, Frontiers in Language Sciences. Psychology, 1, 174, 2010) and the English Lexicon Project (ELP; Balota et al., 2007), because it applies the repeated measures design of the DLP to the English language. The high correlation between the BLP and ELP data indicates that a high percentage of variance in lexical decision data sets is systematic variance, rather than noise, and that the results of megastudies are rather robust with respect to the selection and presentation of the stimuli. Because of its design, the BLP makes the same analyses possible as the DLP, offering researchers with a new interesting data set of word-processing times for mixed effects analyses and mathematical modeling. The BLP data are available at http://crr.ugent.be/blp and as Electronic Supplementary Materials.Electronic supplementary materialThe online version of this article (doi:10.3758/s13428-011-0118-4) contains supplementary material, which is available to authorized users.
Nonwords are essential in lexical decision tasks in which participants are confronted with strings of letters or sounds and have to decide whether the stimulus forms an existing word. Together with word naming, semantic classification, perceptual identification, and eye-movement tracking during reading, the lexical decision task is one of the core instruments in the psycholinguist's toolbox for the study of word processing.Although researchers are concerned particularly with the quality of their word stimuli (because their investigation depends on them), there is plenty of evidence that the nature of the nonwords also has a strong impact on lexical decision performance. As a rule, the more dissimilar the nonwords are to the words, the faster are the lexical decision times and the smaller is the impact of word features such as word frequency, age of acquisition, and spelling-sound consistency (e.g., Borowsky & Masson, 1996;Gerhand & Barry, 1999;Ghyselinck, Lewis, & Brysbaert, 2004;Gibbs & Van Orden, 1998). For instance, in Gibbs and Van Orden (Experiment 1), lexical decision times to the words were shortest (496 msec) when the nonwords were illegal letter strings (i.e., letter sequences, such as ldfa, that are not observed in the language), longer (558 msec) when the nonwords were legal letter strings (e.g., dilt), and still longer (698 msec) when the nonwords were pseudohomophones (i.e., sounding like real words, e.g., durt). At the same time, the difference in reaction times (RTs) between words with a consistent rhyme pronunciation (e.g., beech) and matched words with an inconsistent rhyme pronunciation (e.g., beard [inconsistent with heard]) increased. Because of the impact of the nonwords on lexical decision performance, there is general agreement among researchers that nonwords should be legal nonwords, unless there are theoretical reasons to use illegal nonwords. Legal nonwords that conform to the orthographic and phonological patterns of a language are also called pseudowords.Although the requirement of pseudowords solves many problems for the creation of nonwords in the lexical decision task, there are additional considerations that must be taken into account. Because lexical decision is, in essence, a signal detection task (e.g., Ratcliff, Gomez, & McKoon, 2004), participants in a lexical decision task not only base their decision on whether the stimuli belong to the language, they also rely on other cues that help to differentiate between the word and nonword stimuli. In the same way that participants learn ties in apparently random materials generated on the basis of an underlying grammar (i.e., the phenomenon of implicit learning; Reber, 1989), so are participants susceptible to systematic differences between the word trials (requiring a "yes" response) and the nonword trials (requiring a "no" response). They exploit these biases to optimize their responses. Chumbley and Balota's (1984) study provides an example of this process. Because of an oversight, in their Experiment 2, the nonwords were on aver...
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.