Abstract:One of the fundamental questions about human language is whether its entropy rate is positive. The entropy rate measures the average amount of information communicated per unit time. The question about the entropy of language dates back to experiments by Shannon in 1951, but in 1990 Hilberg raised doubt regarding a correct interpretation of these experiments. This article provides an in-depth empirical analysis, using 20 corpora of up to 7.8 gigabytes across six languages (English, French, Russian, Korean, Chinese, and Japanese), to conclude that the entropy rate is positive. To obtain the estimates for data length tending to infinity, we use an extrapolation function given by an ansatz. Whereas some ansatzes were proposed previously, here we use a new stretched exponential extrapolation function that has a smaller error of fit. Thus, we conclude that the entropy rates of human languages are positive but approximately 20% smaller than without extrapolation. Although the entropy rate estimates depend on the script kind, the exponent of the ansatz function turns out to be constant across different languages and governs the complexity of natural language in general. In other words, in spite of typological differences, all languages seem equally hard to learn, which partly confirms Hilberg's hypothesis.
We propose an unsupervised segmentation method based on an assumption about language data: that the increasing point of entropy of successive characters is the location of a word boundary. A large-scale experiment was conducted by using 200 MB of unsegmented training data and 1 MB of test data, and precision of 90% was attained with recall being around 80%. Moreover, we found that the precision was stable at around 90% independently of the learning data size.
A fundamental problem in linguistics is how literary texts can be quantified mathematically. It is well known that the frequency of a (rare) word in a text is roughly inverse proportional to its rank (Zipf’s law). Here we address the complementary question, if also the rhythm of the text, characterized by the arrangement of the rare words in the text, can be quantified mathematically in a similar basic way. To this end, we consider representative classic single-authored texts from England/Ireland, France, Germany, China, and Japan. In each text, we classify each word by its rank. We focus on the rare words with ranks above some threshold Q and study the lengths of the (return) intervals between them. We find that for all texts considered, the probability SQ(r) that the length of an interval exceeds r, follows a perfect Weibull-function, SQ(r) = exp(−b(β)rβ), with β around 0.7. The return intervals themselves are arranged in a long-range correlated self-similar fashion, where the autocorrelation function CQ(s) of the intervals follows a power law, CQ(s) ∼ s−γ, with an exponent γ between 0.14 and 0.48. We show that these features lead to a pronounced clustering of the rare words in the text.
This article presents a novel approach for readability assessment through sorting. A comparator that judges the relative readability between two texts is generated through machine learning, and a given set of texts is sorted by this comparator. Our proposal is advantageous because it solves the problem of a lack of training data, because the construction of the comparator only requires training data annotated with two reading levels. The proposed method is compared with regression methods and a state-of-the art classification method. Moreover, we present our application, called Terrace, which retrieves texts with readability similar to that of a given input text.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.