According to recent work, detection on duplicate bug reports has received much attention. One of the reasons is that duplicate bug reports may consume time of bug triagers and software developers. In previous studies, many schemes have been developed for using text mining techniques or using the information retrieval and natural language processing techniques. In this paper, we propose a method to improve centroid characteristics by adjusting centroids with better initial values than based on Class-Feature-Centroid (CFC) [12]. With the effectiveness of CFC, the centroidbased approach can obtain further improvements for detection performance. The method includes two steps. First, we extract inter-class and inner-class term indices from the corpus. Second, we enhance centroid calculation based on class features. Moreover, for similarity measure we also adapt the calculation of the traditional cosine similarity by denormalized cosine measure which is also used in [12].
Text mining is a narrow research field of data mining, which focuses on discovering new information from text document collections, mainly by using techniques from data mining, machine learning, natural language processing and information retrieval. Text classification is the process of analyzing text content and then giving decision whether this text can belong to one group, many groups or it does not belong to the text group which is defined before. On over the world, there have been many effective researches on this problem, especially on texts in English. However, there have been few researches on Vietnamese texts. Moreover, these researching results and applications are still limited partly due to the typical characteristics of Vietnamese language in term of words and sentences and there are many words with many meanings in many different contexts. Text classification problem is the one with many featues, thus to improve the effectiveness of text classification is the aim of may researchers. In this research, the author constructs two methods of feature selection: singular value decomposition and optimal orthogonal centroid feature selection in text classification with high efficiency of calculation proven on English text document and now they are proven on Vietnamese text document. There are many classification techniques, but we implemented on the learning machine algorithms support vector machines. This method has been proven to be effective for text classification problems. With the technique of feature selection singular value decomposition and optimal orthogonal centroid feature selection, the implementing result higher than that of traditional method.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.