Abstract-In order to effectively retrieve required information from the large amount of information collected from the Internet, document clustering in text mining becomes a popular research topic. Clustering is the unsupervised classification of data items into groups without the need of training data. Many conventional document clustering methods perform inefficiently for large document of collected information and require special handling for high dimensionality and high volume. We propose the OCFI (Ontology and Closed Frequent Itemset-based Hierarchical Clustering) method, which is a hierarchical clustering method developed for document clustering. OCFI uses common words to cluster documents and builds hierarchical topic tree. In addition, OCFI utilizes ontology to solve the semantic problem and mine the meaning behind the words in documents. Furthermore, we use the closed frequent itemsets instead of only use frequent itemsets, which increases efficiency and scalability. The experimental results reveal that our method is more effective than the well-known document clustering algorithms. The clustering results can be used in the personalized search service to assist users to obtain the information they need.Index Terms-OCFI, documents clustering, ontology, closed frequent itemsets.
I. INTRODUCTIONDue to the popularity of the Internet, a large number of documents, reports, e-mails, and web pages cause the information overload problem. Many enterprises spend lots of manpower on organizing these unstructured data into a logical structure for later use. In order to save the manpower and find interesting knowledge effectively, text mining becomes more and more important. information from text, text mining usually involves the process of parsing the input text, mining valuable information through the analysis models, and finally evaluating and interpreting the output. However, most of the text mining algorithms do not have enough capability to solve the problems from text mining, because many methods are modifications of traditional data mining algorithms that were originally designed for relational database. Therefore, traditional clustering algorithms become impractical in real-world document clustering which requires special handling for high dimensionality, high volume, and ease of browsing.Fung proposed Frequent Itemset-based Hierarchical Clustering (FIHC) [11] method to solve the problems from traditional algorithms. FIHC is a hierarchical clustering method developed for document clustering. Clustering or cluster analysis is the task of assigning a set of objects into groups, called clusters, so that the objects in the same cluster are more similar to each other than to those in other clusters. Clustering is a main topic of data mining algorithm, and a common technique for statistical data analysis used in many fields, including machine learning, image analysis, information retrieval, and bioinformatics.A major breakthrough of FIHC is that the clustering algorithm utilizes an important notion, frequent items...