2016
DOI: 10.48550/arxiv.1606.06996
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

The word entropy of natural languages

Christian Bentz,
Dimitrios Alikaniotis

Abstract: The average uncertainty associated with words is an informationtheoretic concept at the heart of quantitative and computational linguistics. The entropy has been established as a measure of this average uncertainty -also called average information content. We here use parallel texts of 21 languages to establish the number of tokens at which word entropies converge to stable values. These convergence points are then used to select texts from a massively parallel corpus, and to estimate word entropies across mor… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

0
2
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
3
1

Relationship

0
4

Authors

Journals

citations
Cited by 4 publications
(2 citation statements)
references
References 21 publications
0
2
0
Order By: Relevance
“…Entropy We also consider measuring the amount of information with entropy H (Bentz and Alikaniotis, 2016;He et al, 2022) in the word w to reflect the importance of that word:…”
Section: Forward Process With Soft-maskingmentioning
confidence: 99%
“…Entropy We also consider measuring the amount of information with entropy H (Bentz and Alikaniotis, 2016;He et al, 2022) in the word w to reflect the importance of that word:…”
Section: Forward Process With Soft-maskingmentioning
confidence: 99%
“…By extracting a distinct inventory of syllables, we establish symbolic representations for each syllable, enabling us to interpret the birdsong as a conventional textual sequence. If the sequence exhibits language-like properties, it is expected to demonstrate similar patterns of word entropy observed in other languages [10]. Jalak Suren (Sturnidae family) is a widely recognized avian species in Indonesia renowned for its melodic and repetitive vocalizations [11].…”
Section: Introductionmentioning
confidence: 98%