The paragraph spaces of five text corpora, of different genres and intended audiences, in four different languages, all show the same two-scale structure, with the dimension at short distances being lower than at long distances. In all five cases the short-distance dimension is approximately eight. Control simulations with randomly permuted word instances do not exhibit a low dimensional structure. The observed topology places important constraints on the way in which authors construct prose, which may be universal.correlation dimension | language | latent semantic analysis A s we transition from paragraph to paragraph in written discourse, one can think of the path through which one passes as a trajectory through a semantic space. Understanding the discourse is, in some sense, a matter of understanding this trajectory. Although it is difficult to predict exactly what will follow from a given discourse context, we can ask a broader question. Does the cognitive system impose any fundamental constraints on the way in which the discourse evolves?To investigate this issue, we constructed semantic spaces for five corpora, of different genres and intended audiences, in four different languages and then calculated the intrinsic dimensionality of the paragraph trajectories through these corpora. Each trajectory can be embedded in an arbitrary number of dimensions, but it has an intrinsic dimensionality, independent of the embedding space. For instance, in latent semantic analysis (LSA) applications, it is typical to use a 300-dimensional space (1). However, the points that represent the paragraphs in this 300-dimensional space do not fill the embedding space; rather they lie on a subspace with a dimension lower than the embedding dimension. The fact that a lowdimensional structure can be embedded in a higher dimensional space is routinely used in the study of nonlinear dynamic systems, in which the embedding theorem (2) relates the dimensionality of the dataset under study to the dimensionality of the dynamics that describes it.Historically, the dimensionality of the discourse trajectory has been implicitly assumed to be very large, but it has never been calculated. Here we show that the dimensionality of the trajectory is surprisingly low (approximately eight at short distances) and that its structure is probably universal across human languages. Although the question of dimensionality has not generally been considered before, it can be used to guide the development of new models of prose, which are constrained to reproduce the observed dimensional structure.
Modeling SemanticsThe first step toward being able to calculate the dimensionality of text is to create a vector representation of the semantics conveyed by each paragraph. Recent years have seen increasing interest in automated methods for the construction of semantic representations of paragraphs [e.g., LSA (3), the topics model (4, 5), nonnegative matrix factorization (6), and the constructed semantics model (7)]. These methods were originally developed for use...