Abstract:The paper introduces the ORTOfON corpus of spontaneous spoken Czech and the DIALEKT corpus of Czech dialects, their design principles and practical solutions adopted during data collection.
This paper is part of a larger research effort on language variability aimed at uncovering the relations between extra- and intratextual characteristics of Czech texts by means of multi-dimensional analysis. The palpable lack of prior art on quantitative register analysis of Czech led to several distinctive methodological decisions, concerning namely corpus design, feature selection and the parameters of factor analysis, especially the number of dimensions to extract. We report on these for their potential relevance to other researchers embarking on a similar journey. In order to demonstrate the viability of the model, we also present a brief interpretation of the resulting dimensions.
Research into causal conjunctions suggests that there are various degrees of causality and that causality is better situated on a cline between strong and weak. Some studies of English because/'cause/cos suggest a diachronic change in the spoken language, where the use of because is shifting from prototypical subordinator to discourse marker (Stenström, in: Jucker, Ziv (eds) Discourse markers, John Benjamins, Amsterdam, 1998; Burridge in Aust J Linguist 34(4): 2014). This study examines in detail the use of the most frequent Czech causal conjunction protože in both written and spoken language, thus making a further contribution to cross-linguistic research into causality and to research into the differences between spoken and written language more generally. There are two major language varieties of Czech: the common vernacular and the standard literary language (the codified norm). These two varieties differ in a number of respects-at the morphological, lexical and phonological levels. In comparing spoken and written Czech, very few studies include syntactic features and none are based on large-scale authentic spoken data. Based on the corpus data, the conjunction protože occurs strikingly more frequently in spoken Czech than in written language. This study looks at some differences in its distribution. The study is based on extensive corpus data of both written Czech (comprising fiction, newspapers and academic texts) and spoken Czech (corpora of spontaneous conversations and TV debates).
The present paper seeks to review relevant criteria used in classifying speech events (SEs) from the perspective of spoken corpus design. The primary goal is to survey the landscape of possible types of spoken language, so as to assess in which directions the coverage of spoken Czech offered by Czech National Corpus corpora can be expanded in the future. We approach the problem from both theoretical and practical points of view, examining what the theoretical literature has to say as well as approaches implemented in practice by existing spoken corpora of various languages. We then synthesize the obtained information into a pragmatically motivated set of SE classification criteria which does not aspire to be universal or definitive but aims to serve as a useful guiding principle and conceptual framework for understanding and promoting SE diversity when collecting spoken data.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.