Evaluating Computational Language Models with Scaling Properties of Natural Language

Takahashi, Shuntaro; Tanaka-Ishii, Kumiko

doi:10.1162/coli_a_00355

Cited by 21 publications

(15 citation statements)

References 35 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Besides these good BA graph results, however, all random walks on a graph structure learned from Moby Dick (first row of the second block and first two rows of the third block of table 1) produced α ; 0.5. This suggests that linguistic sequences cannot be modeled by Markov models, which confirms both previous mathematical results (Lin and Tegmark 2016) and experimental results (Takahashi and Tanaka-Ishii 2018). The main reason is that the mean degree of the Markov models was large (above 10).…”

Section: Discussionsupporting

confidence: 89%

“…Our group (Takahashi and Tanaka-Ishii 2018) has also shown how all these models except neural models could not produce a Taylor exponent larger than 0.5. For example, figure 5 shows the Taylor analyses of texts generated by a first-order Markov model trained with the real Moby Dick and by the Simon process.…”

Section: Discussionmentioning

confidence: 95%

“…Both plots were tightly distributed around the regression line with an exponent of 0.5. (Takahashi and Tanaka-Ishii 2018) shows how the exponent cannot be improved by increasing the order of the model or by using advanced techniques.…”

Section: Discussionmentioning

confidence: 99%

See 2 more Smart Citations

Taylor’s law for linguistic sequences and random walk models

Tanaka-Ishii

Kobayashi

2018

J. Phys. Commun.

Self Cite

View full text Add to dashboard Cite

Taylor's law describes the fluctuation characteristics underlying a complex system in which the variance of an event within a time span grows by a power law with respect to the mean. The previous paper, Taylor's Law for Linguistic Sequences and Random Walk Models (Tanaka-Ishii and Kobayashi 2018), appeared in Journal of Physics Communications and described a new way to apply Taylor analysis to texts. The method was applied to over 1100 texts across 14 languages. The results showed how the Taylor exponents of natural-language written texts were consistently around 0.58, thus being universal.Experimentally, the Taylor exponent α is known to take a value within the range of 0.5α1.0 across a wide variety of domains, including finance, meteorology, agriculture, and biology. The previous paper shows how this is the case for language.The Taylor exponent is analytically proven to be 0.5 for an independent and identically distributed (i.i.d.) process. The paper also shows a case when 1.0 is reached. This Addendum provides two additional cases of rare word alignment for α=0.5 and α=1.0. These cases provide an understanding to interpret the value of the exponent of a real text.Consider dividing a text of length N into Q segments of length Δt, i.e., N=QΔt. Suppose that Q is sufficiently large.First of all, if a word only appears once in the entire text, then μ 1 and σ 1 are calculated as follows. AbstractTaylor's law describes the fluctuation characteristics underlying a complex system in which the variance of an event within a time span grows by a power law with respect to the mean. Although Taylor's law has been applied in many natural and social systems, its application for language has been scarce. This article describes a new, natural way to apply Taylor analysis to texts. The method was applied to over 1100 texts across 14 languages and showed how the Taylor exponents of natural language written texts were consistently around 0.58, thus being universal. The exponents were also evaluated for other languagerelated data, such as speech corpora (0.63 for adult speech, 0.68 for child-directed speech), programming language sources (0.79), and music (0.79). The results show how the Taylor exponent serves to quantify the fundamental structural complexity underlying linguistic time series. To explain the nature of natural language sequences possessing such different degrees of fluctuation, we investigated various mathematical models that could produce a Taylor exponent similar to that of real data. While the majority of previous probabilistic sequential models could not produce a Taylor exponent larger than 0.50, the same as in the independent and identically distributed (i.i.d.) case, random walk sequences on complex networks could produce fluctuation. We show that among various possibilities, random walks on a Barabási-Albert (BA) graph with small mean degree could fulfill the scaling properties of Zipf's law and the long-range correlation, in addition to having a Taylor's law exponent larger than 0.5, thus giving a...

show abstract

Section: Discussionsupporting

confidence: 89%

Section: Discussionmentioning

confidence: 95%

See 1 more Smart Citation

Taylor’s law for linguistic sequences and random walk models

Tanaka-Ishii

Kobayashi

2018

J. Phys. Commun.

Self Cite

View full text Add to dashboard Cite

show abstract

“…erefore, the task of word segmentation has changed from the use of algorithms to modeling [21]. However, this change results in significant disadvantages, namely [22], (1) A parameter space too large to be practical (2) A sparse data matrix e averaged perceptron (AP) [23] refers to a perceptron that records the cumulative value of the feature weights and uses averaging to obtain the final model [24]. Although the model segmentation performance of the model trained with the original training set is not very high, the improvement is significant after incremental training.…”

Section: Related Workmentioning

confidence: 99%

An Efficient Minimal Text Segmentation Method for URL Domain Names

Zhu

et al. 2021

Scientific Programming

View full text Add to dashboard Cite

Text segmentation of the URL domain name is a straightforward and convenient method to analyze users’ online behaviors and is crucial to determine their areas of interest. However, the performance of popular word segmentation tools is relatively low due to the unique structure of the website domain name (such as extremely short lengths, irregular names, and no contextual relationship). To address this issue, this paper proposes an efficient minimal text segmentation (EMTS) method for URL domain names to achieve efficient adaptive text mining. We first designed a targeted hierarchical task model to reduce noise interference in minimal texts. We then presented a novel method of integrating conflict game into the two-directional maximum matching algorithm, which can make the words with higher weight and greater probability to be selected, thereby enhancing the accuracy of recognition. Next, Chinese Pinyin and English mapping were embedded in the word segmentation rules. Besides, we incorporated a correction factor that considers the text length into the F1-score to optimize the performance evaluation of text segmentation. The experimental results show that the EMTS yielded around 20 percentage points improvement with other word segmentation tools in terms of accuracy and topic extraction, providing high-quality data for the subsequent text analysis.

show abstract

“…We show that the sensitivity of words to population size is also reflected in their meaning. We also investigate how social media language and city size affects the parameters of Zipf’s Law [40], and how the exponent of Zipf’s Law is different from that of the literature value [40,41]. We also show that the number of new words needed in longer texts (Heaps' Law [2]) exhibits a sublinear power-law form on Twitter, indicating a decelerating growth of distinct tokens with city size.…”

Section: Introductionmentioning

confidence: 99%

Scaling in words on Twitter

2019

View full text Add to dashboard Cite

Scaling properties of language are a useful tool for understanding generative processes in texts. We investigate the scaling relations in citywise Twitter corpora coming from the metropolitan and micropolitan statistical areas of the United States. We observe a slightly superlinear urban scaling with the city population for the total volume of the tweets and words created in a city. We then find that a certain core vocabulary follows the scaling relationship of that of the bulk text, but most words are sensitive to city size, exhibiting a super- or a sublinear urban scaling. For both regimes, we can offer a plausible explanation based on the meaning of the words. We also show that the parameters for Zipf’s Law and Heaps' Law differ on Twitter from that of other texts, and that the exponent of Zipf’s Law changes with city size.

show abstract

Evaluating Computational Language Models with Scaling Properties of Natural Language

Cited by 21 publications

References 35 publications

Taylor’s law for linguistic sequences and random walk models

Taylor’s law for linguistic sequences and random walk models

An Efficient Minimal Text Segmentation Method for URL Domain Names

Scaling in words on Twitter

Contact Info

Product

Resources

About