In most of the research, topic detection is defined as the task of finding out different themes from the collection of documents. Our topic detection approach is about finding a topic for every document in the corpus. Any word or group of words which tells what the document is about is defined as the topic of the document. In this paper, we propose a novel topic detection approach using an unsupervised model. It is a simple yet effective approach for topic detection and finding keywords from the corpus.The keywords are extracted by identifying the relationship between the words in a set of unstructured data automatically, without any set of training data. The keyword extraction is based on an hypothesis for word decomposition which says that the words in bigram or trigram word vectors would have words that can be potential distribution of words from the unigram word vector. After keyword extraction, topics are determined for each document using our proposed algorithm of topic detection. The proposed algorithm finds the most suitable topic for each document. The topics detected in the entire corpus and the keywords related with each topic are stored and analyzed. We use the standard term frequency (TF) measure for finding the keywords. The effectiveness and accuracy of keywords is judged by using these keywords as features for classification and comparing the results against the standard bag-of-words approach. The topics detected by our algorithm are found to be relevant to the document. The experimental results using keywords show that the dimensionality of the corpus is drastically reduced while maintaining and in most of the cases, improving Fmeasure of categorization. Thus, it shows that our approach of feature selection for text categorization not only improves the classification accuracy but also reduces considerably the time required for classification.
Categories and Subject Descriptors
H.4 [I.7] : Document and Text Processingstructures. This is just an example, please use the correct category and subject descriptors for your submission. The ACM Computing Classification Scheme: