2008
DOI: 10.1140/epjb/e2008-00206-x
|View full text |Cite
|
Sign up to set email alerts
|

Statistical keyword detection in literary corpora

Abstract: Understanding the complexity of human language requires an appropriate analysis of the statistical distribution of words in texts. We consider the information retrieval problem of detecting and ranking the relevant words of a text by means of statistical information referring to the spatial use of the words. Shannon's entropy of information is used as a tool for automatic keyword extraction. By using The Origin of Species by Charles Darwin as a representative text sample, we show the performance of our detecto… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
78
0
1

Year Published

2011
2011
2022
2022

Publication Types

Select...
6
3

Relationship

0
9

Authors

Journals

citations
Cited by 65 publications
(79 citation statements)
references
References 16 publications
0
78
0
1
Order By: Relevance
“…Because the distribution of 'Carmylle' is much more irregular than the distribution of 'feel', the former takes a much higher value of intermittency, as defined in equation (5). The burstiness revealed by 'Carmylle' also suggests that this word represents a relevant concept in the book [31]. An important property of the intermittency measurement is that it does not correlate with the frequency (see figure 3).…”
Section: Intermittencymentioning
confidence: 99%
“…Because the distribution of 'Carmylle' is much more irregular than the distribution of 'feel', the former takes a much higher value of intermittency, as defined in equation (5). The burstiness revealed by 'Carmylle' also suggests that this word represents a relevant concept in the book [31]. An important property of the intermittency measurement is that it does not correlate with the frequency (see figure 3).…”
Section: Intermittencymentioning
confidence: 99%
“…However, there are numerous situations where the comparison with a general database is not available or is not interesting: for instance, when authorship has to be attributed without previous knowledge of texts written by the potential authors. Here, we approach these problems by taking advantage of the finding that words are unevenly distributed not only across documents but also within them [16][17][18][21][22][23][24]. The quantification of the uneven distribution of words has been proposed based on measures commonly used by physicists [16,24].…”
Section: Intermittency Measurementsmentioning
confidence: 99%
“…The usefulness of the latter approach stems from the finding that topical words are unevenly distributed along the text when compared with a random process or function words. This observation can be quantitatively investigated using different analogies and measures familiar to the communities of statistical physics and dynamical systems, including level statistics [22,24], burstiness [17][18][19], entropy [21] and intermittency measures [20]. The author dependence of the features mentioned above has been noted [14,17], but very little work has so far been devoted to quantifying the extent of this dependence and to testing its usefulness for the automatic detection of authors.…”
Section: Introductionmentioning
confidence: 99%
“…In 2008, J.P. Herrera et al tackled the problem of finding and ranking the relevant words of a document by using statistical information referring to the spatial use of the words [6]. Shannon's entropy of information was used for automatic keyword extraction.…”
Section: Some Linguistic Properties Of Keyphrasesmentioning
confidence: 99%