The Relaxed Hilberg Conjecture: A Review and New Experimental Support

Dębowski, Łukasz

doi:10.1080/09296174.2015.1106268

Cited by 7 publications

(14 citation statements)

References 29 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In simple words, whereas entropy rate measures how hard it is to predict the text, exponent β measures how hard it is to learn to predict the text. Whereas the entropy rate strongly depends on the kind of the script, the exponent β turned out to be approximately constant, β ≈ 0.884, across six languages, as supposed in [9,11,12,33]. Thus we suppose that the exponent β is a language universal and it characterizes the general complexity of learning of natural language, all languages being equally hard to learn in spite of apparent differences.…”

Section: Discussionmentioning

confidence: 99%

“…As implicitly or explicitly supposed in [9,11,12,33], the β exponents could be some language universals, which is tantamount to saying that all human languages are equally hard to learn. Universality of exponent β ≈ 0.9 on much smaller data sets for the English, German, and French languages using ansatz f 1 (n) has been previously reported in paper [33] in case of the Lempel-Ziv code rather than the PPM code. Our experimental data further corroborate universality of β, across a larger set of languages and a different universal code.…”

Section: Universality Of the Estimates Of Exponent βmentioning

confidence: 99%

“…As a by-product, using function f 3 (n) we will obtain smaller estimates of the entropy rate than using function f 1 (n). Hilberg [9] and a few other researchers [11,12,33] seemed to suppose that exponent β is does not depend on a particular corpus of texts, i.e., it is some language universal which determines how hard it is to learn to predict the text. Exponent β is thus some important parameter of language, which is complementary to the entropy rate, which determines how hard it is to predict the text once the optimal prediction scheme has been learned.…”

Section: Extrapolation Functionsmentioning

confidence: 99%

See 2 more Smart Citations

Entropy Rate Estimates for Natural Language—A New Extrapolation of Compressed Large-Scale Corpora

Takahira

Tanaka-Ishii

Dębowski

2016

Entropy

Self Cite

103

View full text Add to dashboard Cite

Abstract:One of the fundamental questions about human language is whether its entropy rate is positive. The entropy rate measures the average amount of information communicated per unit time. The question about the entropy of language dates back to experiments by Shannon in 1951, but in 1990 Hilberg raised doubt regarding a correct interpretation of these experiments. This article provides an in-depth empirical analysis, using 20 corpora of up to 7.8 gigabytes across six languages (English, French, Russian, Korean, Chinese, and Japanese), to conclude that the entropy rate is positive. To obtain the estimates for data length tending to infinity, we use an extrapolation function given by an ansatz. Whereas some ansatzes were proposed previously, here we use a new stretched exponential extrapolation function that has a smaller error of fit. Thus, we conclude that the entropy rates of human languages are positive but approximately 20% smaller than without extrapolation. Although the entropy rate estimates depend on the script kind, the exponent of the ansatz function turns out to be constant across different languages and governs the complexity of natural language in general. In other words, in spite of typological differences, all languages seem equally hard to learn, which partly confirms Hilberg's hypothesis.

show abstract

Section: Discussionmentioning

confidence: 99%

Section: Universality Of the Estimates Of Exponent βmentioning

confidence: 99%

Section: Extrapolation Functionsmentioning

confidence: 99%

See 1 more Smart Citation

Entropy Rate Estimates for Natural Language—A New Extrapolation of Compressed Large-Scale Corpora

Takahira

Tanaka-Ishii

Dębowski

2016

Entropy

Self Cite

103

View full text Add to dashboard Cite

show abstract

“…[ 25 ], a completely formal proof of the theorem about facts and words for strictly minimal grammar-based codes [ 23 , 26 ] was provided. The respective related theory of natural language was later reviewed in [ 27 , 28 ] and supplemented by a discussion of Santa Fe processes in [ 29 ]. A drawback of this theory at that time was that strictly minimal grammar-based codes used in the statement of the theorem about facts and words are not computable in a polynomial time [ 26 ].…”

Section: Introductionmentioning

confidence: 99%

Is Natural Language a Perigraphic Process? The Theorem about Facts and Words Revisited

Dębowski

2018

Entropy

Self Cite

View full text Add to dashboard Cite

Abstract:As we discuss, a stationary stochastic process is nonergodic when a random persistent topic can be detected in the infinite random text sampled from the process, whereas we call the process strongly nonergodic when an infinite sequence of independent random bits, called probabilistic facts, is needed to describe this topic completely. Replacing probabilistic facts with an algorithmically random sequence of bits, called algorithmic facts, we adapt this property back to ergodic processes. Subsequently, we call a process perigraphic if the number of algorithmic facts which can be inferred from a finite text sampled from the process grows like a power of the text length. We present a simple example of such a process. Moreover, we demonstrate an assertion which we call the theorem about facts and words. This proposition states that the number of probabilistic or algorithmic facts which can be inferred from a text drawn from a process must be roughly smaller than the number of distinct word-like strings detected in this text by means of the Prediction by Partial Matching (PPM) compression algorithm. We also observe that the number of the word-like strings for a sample of plays by Shakespeare follows an empirical stepwise power law, in a stark contrast to Markov processes. Hence, we suppose that natural language considered as a process is not only non-Markov but also perigraphic.

show abstract

“…Our work establishes a general link between syntactic structure and the statistical properties of texts, joining other work which has established connections between grammatical rules and informationtheoretic statistics (Dębowski, 2015). We believe the HDMI Hypothesis can form the basis for improved grammar induction algorithms, by providing a new perspective on the head-outward generative models that have formed the basis of most work in that area.…”

Section: Resultsmentioning

confidence: 67%

Syntactic dependencies correspond to word pairs with high mutual information

Futrell¹,

Qian²,

Gibson³

et al. 2019

Proceedings of the Fifth International Conference on Dependency Linguistics (Depling, SyntaxFest 2019)

View full text Add to dashboard Cite

How is syntactic dependency structure reflected in the statistical distribution of words in corpora? Here we give empirical evidence and theoretical arguments for what we call the Head-Dependent Mutual Information (HDMI) Hypothesis: that syntactic heads and their dependents correspond to word pairs with especially high mutual information, an information-theoretic measure of strength of association. In support of this idea, we estimate mutual information between word pairs in dependencies based on an automatically-parsed corpus of 320 million tokens of English web text, finding that the mutual information between words in dependencies is robustly higher than a controlled baseline consisting of non-dependent word pairs. Next, we give a formal argument which derives the HDMI Hypothesis from a probabilistic interpretation of the postulates of dependency grammar. Our study also provides some useful empirical results about mutual information in corpora: we find that maximum-likelihood estimates of mutual information between raw wordforms are biased even at our large sample size, and we find that there is a general decay of mutual information between part-of-speech tags with distance.

show abstract

The Relaxed Hilberg Conjecture: A Review and New Experimental Support

Cited by 7 publications

References 29 publications

Entropy Rate Estimates for Natural Language—A New Extrapolation of Compressed Large-Scale Corpora

Entropy Rate Estimates for Natural Language—A New Extrapolation of Compressed Large-Scale Corpora

Is Natural Language a Perigraphic Process? The Theorem about Facts and Words Revisited

Syntactic dependencies correspond to word pairs with high mutual information

Contact Info

Product

Resources

About