2016
DOI: 10.3390/e18100364
|View full text |Cite
|
Sign up to set email alerts
|

Entropy Rate Estimates for Natural Language—A New Extrapolation of Compressed Large-Scale Corpora

Abstract: Abstract:One of the fundamental questions about human language is whether its entropy rate is positive. The entropy rate measures the average amount of information communicated per unit time. The question about the entropy of language dates back to experiments by Shannon in 1951, but in 1990 Hilberg raised doubt regarding a correct interpretation of these experiments. This article provides an in-depth empirical analysis, using 20 corpora of up to 7.8 gigabytes across six languages (English, French, Russian, Ko… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

2
103
1

Year Published

2017
2017
2022
2022

Publication Types

Select...
5
2
1
1

Relationship

2
7

Authors

Journals

citations
Cited by 45 publications
(107 citation statements)
references
References 32 publications
2
103
1
Order By: Relevance
“…Again, only texts with a discrepancy of less then 10% are included. In contrast, [11] establishes the convergence properties of different off-the-shelf compressors by estimating the encoding rate with growing text sizes. This has the advantage of giving a more fine-grained impression of convergence properties.…”
Section: Stabilization Criterionmentioning
confidence: 99%
See 1 more Smart Citation
“…Again, only texts with a discrepancy of less then 10% are included. In contrast, [11] establishes the convergence properties of different off-the-shelf compressors by estimating the encoding rate with growing text sizes. This has the advantage of giving a more fine-grained impression of convergence properties.…”
Section: Stabilization Criterionmentioning
confidence: 99%
“…Shannon [1] defined the entropy, or average information content, as a measure for the choice associated with symbols in strings. Since Shannon's [2] original proposal, many researchers have undertaken great efforts to estimate the entropy of written English with the highest possible precision [3][4][5][6] and to broaden the account to other natural languages [7][8][9][10][11].…”
Section: Introductionmentioning
confidence: 99%
“…This work can be situated as a study to quantify the complexity underlying texts. As summarized in (Tanaka-Ishii and Aihara, 2015), measures for this purpose include the entropy rate (Takahira, Tanaka-Ishii, and Lukasz, 2016;Bentz et al, 2017) and those related to the scaling behaviors of natural language. Regarding the latter, certain power laws are known to hold universally in linguistic data.…”
Section: Related Workmentioning
confidence: 99%
“…The more predictable the text is, the smaller r(n) becomes; therefore, r(n) is smaller for longer n, exhibiting decay. The fitting function here is a power ansatz function, f (n) = An β−1 +h, proposed by (Hilberg 1990), and the compressor was PPMd, using the 7zip application (refer to (Takahira et al 2016) for details). In addition to the original text, the WSJ was shuffled at the character, word, and document levels.…”
Section: Supporting Informationmentioning
confidence: 99%