2005
DOI: 10.1007/11562214_24
|View full text |Cite
|
Sign up to set email alerts
|

Automatic Term Extraction Based on Perplexity of Compound Words

Abstract: Abstract. Many methods of term extraction have been discussed in terms of their accuracy on huge corpora. However, when we try to apply various methods that derive from frequency to a small corpus, we may not be able to achieve sufficient accuracy because of the shortage of statistical information on frequency. This paper reports a new way of extracting terms that is tuned for a very small corpus. It focuses on the structure of compound terms and calculates perplexity on the term unit's left-side and right-sid… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
8
0

Year Published

2008
2008
2019
2019

Publication Types

Select...
3
3
2

Relationship

0
8

Authors

Journals

citations
Cited by 15 publications
(8 citation statements)
references
References 4 publications
0
8
0
Order By: Relevance
“…For example, removing stop words using a stop word list [8], using part of speech tags [9], and using N-grams [10]. Yoshida and Nakagawa [11] proposed a keyword extraction method for Japanese language documents. This method uses perplexity on the term unit's left-side and right-side terms for extracting technical terms.…”
Section: A Candidate Keyword Extractionmentioning
confidence: 99%
“…For example, removing stop words using a stop word list [8], using part of speech tags [9], and using N-grams [10]. Yoshida and Nakagawa [11] proposed a keyword extraction method for Japanese language documents. This method uses perplexity on the term unit's left-side and right-side terms for extracting technical terms.…”
Section: A Candidate Keyword Extractionmentioning
confidence: 99%
“…FLR is a word scoring method that uses internal structures and frequencies of candidates (FLR: Frequencies and Left and Right of the current word). One of the advantages of the FLR method is its size-robustness, that it can be applied to small corpus with less significant drop in performance than other standard methods like TF and IDF, because it is defined using more finegrained features [30].…”
Section: Multi-word Aspectsmentioning
confidence: 99%
“…The particularity of C-value is the consideration of the nested terms. Minoru [5] presented a method combining perplexity and frequency information to rank the candidates. Li [6] removed the non-term items from the candidates based on the CBC clustering algorithm.…”
Section: Introductionmentioning
confidence: 99%