Abstract:If we can operationalize corpus frequency in multiple ways, using absolute values and proportional values, which of them is more closely connected with the behaviour of language users? In this contribution, we examine overabundant cells in morphological paradigms, and look at the contribution that frequency of occurrence can make to understanding the choices speakers make due to this richness. We look at ways of operationalizing the term frequency in data from corpora and native speakers: the proportional frequency of forms (i.e. percentage of time that a variant is found in corpus data considered as a proportion of all variants) and several interpretations of absolute frequency (i.e. the raw frequency of variants in data from the same corpus). Working with data from unmotivated morphological variation in Czech case forms, we show that different instantiations of frequency help interpret the way variation is perceived and maintained by native speakers. Proportional frequency seems most salient for speakers in forming their judgements, while certain types of absolute frequency seem to have a dominant role in production tasks.Key words: corpus linguistics, frequency, morphology, empirical research, surveys, questionnaires, Czech, overabundance
IntroductionFrequency data are familiar territory for any linguist who works with corpora. We cite the number of times a feature appears, or its normalized frequency if we are comparing corpora; we cite percentages to show structure within categories or to demonstrate change over time. Hidden behind the way we deal with these data is an implicit operationalization of our questions about language. We have chosen to let the corpus stand in for a particular language, type of language, genre, etc., but at the same time we have also chosen representations of frequency that give us the best chance of answering our research questions. It is worth interrogating these differing operationalizations of frequency to see how the same data, approached in different ways, can shed a different light on the way native speakers apprehend and use language.The term frequency is elastic, and once we start looking at frequency data there are few limits to the number of ways we can treat it. Divjak (2016) considers, among other meanings, the traditional relative frequency (incidence per million), construction frequency (which itself covers various ways of relating the frequencies between related items), family frequency (incorporating various ways of looking at the size and composition of a class of words) and measures of probability and association. These will largely be beyond the scope of this study, which is focused on how we understand and manipulate the numbers that arise from simple counts of individual forms.Our material comes from three sources. We have data from the Czech National Corpus (CNC) on the frequencies of forms occupying a single morphological "slot". We selected items