1985
DOI: 10.1002/asi.4630360502
|View full text |Cite
|
Sign up to set email alerts
|

Split size‐rank models for the distribution of index terms

Abstract: Since the introduction of the Zipf distribution, many functions have been suggested for the frequency of words in text. Some of these models have also been a p piled to the distribution of Index terms in a set of documents. The models are of two forms: rank-frequency and frequency-size. The former serve well to describe the distribution of high-frequency terms; the latter the distribution of low-frequency terms. In this article, a split model is proposed, which uses both a rank function for the high frequency … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

1
15
0

Year Published

1989
1989
2007
2007

Publication Types

Select...
6

Relationship

0
6

Authors

Journals

citations
Cited by 29 publications
(16 citation statements)
references
References 27 publications
1
15
0
Order By: Relevance
“…Index terms featuring in the "MEDLARS" database (Nelson & Tague, 1985 processes is not an appropriate model for these. In others, though, we found most acceptable fits.…”
Section: Discussionmentioning
confidence: 97%
See 1 more Smart Citation
“…Index terms featuring in the "MEDLARS" database (Nelson & Tague, 1985 processes is not an appropriate model for these. In others, though, we found most acceptable fits.…”
Section: Discussionmentioning
confidence: 97%
“…(j) "MEDLARS" (Nelson & Tague, 1985) When first presented, the long-tailed nature of these data obliged Nelson and Tague to seek different sorts of distribution to model the low-to-medium frequencies and the high frequencies. Sichel (1992) achieved an acceptable fit for the GIGP with (Y -0 assumed a priori, based on sample mean and proportion of singletons, and using the originally reported groupings.…”
Section: Applicationsmentioning
confidence: 99%
“…Figure 1 shows the exhaustivity distributions used in the study representing low (M = 7 terms per document), observed (M ϭ 11 terms per document), and high (M ϭ 15 terms per document) levels of exhaustivity. The hypothetical low and high exhaustivity distributions were based on negative binomial models, which have been shown to be representative of exhaustivity in actual environments (Bird, 1974;Nelson & Tague, 1985). Figure 2 shows the distribution of descriptor occurrences across the document set; that is, the number of documents that contain a given descriptor.…”
Section: Methodsmentioning
confidence: 99%
“…The modeling of databases is not new but a realistic simulation of the relevance relation has not previously been known. Griffiths (1978); Tague, Nelson, and Wu (1981) ;Tague, McClellan, & Nelson (1984); Nelson (1981, 1982); and Nelson and Tague (1985) have developed models of bibliographic databases. They have attempted to model the statistics of term occurrence within documents and globally throughout a whole collection.…”
Section: Introductionmentioning
confidence: 98%