The corpora for this study are from News Co-broadcasting, Daily Conversations and Behind the headlines with Wentao, each of which represents the formal written style, the colloquial style and the conversational style respectively. Sentence length, word length, part of speech (POS) and sentence-initial word POS are selected from the pre-processed corpora as features to generate text vectors and then clustered with PAM (partition around medoids) and Ward algorithms. The clustering results show: (1) It is reasonable to select sentence length, word length, POS and sentence-initial word POS as Chinese quantitative stylistic features. (2) Style is a polarized continuum, as the formal written style and the colloquial style display bipolar distributions while the conversational style lies in between and is near the pole of the colloquial style.
The Menzerath-Altmann law (henceforth the MA law), uncovered by Menzerath and formulated by Altmann, describes the correlation between the language structure and its immediate components with respect to all linguistic levels. This paper examined the correlation between Chinese compound sentences and its components, clauses, in different style texts based on the MA law. The results show that the correlation is described by the MA law only in the formal written texts.
This paper reports an innovative Chinese register study based on regression analysis for sentence length distribution and text clustering. Although end of sentence is not conventionally marked in Chinese, we resolve this issue by assuming that segments between periods, question marks, and exclamation marks are sentences, which can be further divided into simple sentences and compound sentences. We also assume that segments between punctuation marks that express pauses in utterances form sentences (i.e., clauses). Using regression analysis, we find that the frequency distribution of sentence and clause lengths in Chinese can be fitted by the formula F = aLbcL, where L is sentence/clause length. Texts from different registers give rise to different fitted values of the parameters, and hence can serve to differentiate these registers. Finally, we use these parameters to represent and cluster texts from different registers. The successful text clustering results further prove that the parameters of the fitted results are reliable linguistic characteristics for different registers. In terms of linguistic theories, our study shows that it is just as effective to model sentence length in Chinese using sociological words (i.e., characters) as it is using linguistic words.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.