2010
DOI: 10.4236/jilsa.2010.23015
|View full text |Cite
|
Sign up to set email alerts
|

A SOM-Based Document Clustering Using Frequent Max Substrings for Non-Segmented Texts

Abstract: This paper proposes a non-segmented document clustering method using self-organizing map (SOM) and frequent max substring technique to improve the efficiency of information retrieval. SOM has been widely used for document clustering and is successful in many applications. However, when applying to non-segmented document, the challenge is to identify any interesting pattern efficiently. There are two main phases in the propose method: preprocessing phase and clustering phase. In the preprocessing phase, the fre… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
5
0

Year Published

2011
2011
2018
2018

Publication Types

Select...
3
3
2

Relationship

0
8

Authors

Journals

citations
Cited by 9 publications
(5 citation statements)
references
References 20 publications
0
5
0
Order By: Relevance
“…PCA (Principal Component Analysis) and LSI (Latent Semantic Indexing) are two widely used dimension reduction techniques used in literature [13,12,17,21] to reduce the corpus size without impacting the results significantly. Another way to look at tokenization of documents is using Frequent max substring technique [5,3,19] to improve the efficiency of Information Retrieval. This is especially effective in certain Asian languages such as Chinese, Japanese, Korean and Thai.…”
Section: Literature Reviewmentioning
confidence: 99%
“…PCA (Principal Component Analysis) and LSI (Latent Semantic Indexing) are two widely used dimension reduction techniques used in literature [13,12,17,21] to reduce the corpus size without impacting the results significantly. Another way to look at tokenization of documents is using Frequent max substring technique [5,3,19] to improve the efficiency of Information Retrieval. This is especially effective in certain Asian languages such as Chinese, Japanese, Korean and Thai.…”
Section: Literature Reviewmentioning
confidence: 99%
“…Similarity indices have been used in various domains for a long time: e.g. in clustering ecological species (Jaccard, 1908), in plant genetics (Meyeri et al, 2004) or in documents clustering (Chumwatana et al, 2010). Several similarity indices can be used to measure the agreement between two partitions of the same data -P K and P Q with K and Q groups, respectively.…”
Section: Indices Of Paired Agreement Between Partitionsmentioning
confidence: 99%
“…As far as we know, this paper is the first one that gives a quantitative comparison of bag of maximal substrings representation with bag of words representation in document clustering. While Chumwatana et al conduct a similar experiment with respect to Thai documents (Chumwatana et al, 2010), they fail to give reliable evaluation, because their datasets consist of only tens of documents. Further, they do not compare bag of maximal substrings representation with bag of words representation.…”
Section: Introductionmentioning
confidence: 99%