2014
DOI: 10.7232/jkiie.2014.40.1.018
|View full text |Cite
|
Sign up to set email alerts
|

KR-WordRank : An Unsupervised Korean Word Extraction Method Based on WordRank

Abstract: A Word is the smallest unit for text analysis, and the premise behind most text-mining algorithms is that the words in given documents can be perfectly recognized. However, the newly coined words, spelling and spacing errors, and domain adaptation problems make it difficult to recognize words correctly. To make matters worse, obtaining a sufficient amount of training data that can be used in any situation is not only unrealistic but also inefficient. Therefore, an automatical word extraction method which does … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
2

Citation Types

0
10
0

Year Published

2016
2016
2022
2022

Publication Types

Select...
3
2

Relationship

0
5

Authors

Journals

citations
Cited by 6 publications
(10 citation statements)
references
References 14 publications
0
10
0
Order By: Relevance
“…Methods of using affixes to determine Korean word boundaries can be found in existing studies [12], [21]. The study in [21] is similar in that it proposes a method which attempts to exclude meaningless morphemes from the analysis results, but it does not use the number of types of affixes which can be combined with roots.…”
Section: B Methods For Finding Word Boundariesmentioning
confidence: 99%
See 1 more Smart Citation
“…Methods of using affixes to determine Korean word boundaries can be found in existing studies [12], [21]. The study in [21] is similar in that it proposes a method which attempts to exclude meaningless morphemes from the analysis results, but it does not use the number of types of affixes which can be combined with roots.…”
Section: B Methods For Finding Word Boundariesmentioning
confidence: 99%
“…The study in [21] is similar in that it proposes a method which attempts to exclude meaningless morphemes from the analysis results, but it does not use the number of types of affixes which can be combined with roots. In [12], removing endings or postpositional particles according to rules was only discussed in terms of the method's uncertainty. Our proposed method is different in that we use the number of types of affixes which can be combined with roots, and we remove values which are below a threshold from the classification target.…”
Section: B Methods For Finding Word Boundariesmentioning
confidence: 99%
“…The proposed bridge damage factor recognition model aimed to classify each tokenized word as a bridge element, damage, cause, or other. Since tokenization is language-specific, this paper briefly introduces characteristics of the Korean language and tokenization method for Korean, which were suggested by Kim et al (2014).…”
Section: Preprocessing and Tokenizingmentioning
confidence: 99%
“…This study constructed a vocabulary list from the words used in bridge inspection reports and tokenized the text data in the reports based on the vocabulary list by applying a corpus-based tokenization process. As suggested by Kim et al (2014), all word candidates were first identified from the text to be tokenized, and likelihoods that these word candidates are complete words were measured, which will be referred to as word score. Among all possible consecutive strings of characters in each blank-separated unit, the string whose word score was the largest was extracted as a word, and the other string of characters was separated from the extracted word.…”
Section: Preprocessing and Tokenizingmentioning
confidence: 99%
See 1 more Smart Citation