A vast data repository such as the web contains many broad domains of data which are quite distinct from each other e.g. medicine, education, sports and politics. Each of these domains constitutes a subspace of the data within which the documents are similar to each other but quite distinct from the documents in another subspace. The data within these domains is frequently further divided into many subcategories. In this paper we present a novel hybrid parallel architecture using different types of classifiers trained on different subspaces to improve text classification within these subspaces. The classifier to be used on a particular input and the relevant feature subset to be extracted is determined dynamically by using maximum significance values. We use the conditional significance vector representation which enhances the distinction between classes within the subspace. We further compare the performance of our hybrid architecture with that of a single classifier-full data space learning system and show that it outperforms the single classifier system by a large margin when tested with a variety of hybrid combinations on two different corpora. Our results show that subspace classification accuracy is boosted and learning time reduced significantly with this new hybrid architecture.
Abstract. In today's world, the number of electronic documents made available to us is increasing day by day. It is therefore important to look at methods which speed up document search and reduce classifier training times. The data available to us is frequently divided into several broad domains with many sub-category levels. Each of these domains of data constitutes a subspace which can be processed separately. In this paper, separate classifiers of the same type are trained on different subspaces and a test vector is assigned to a subspace using a fast novel method of subspace detection. This parallel classifier architecture was tested with a wide variety of basic classifiers and the performance compared with that of a single basic classifier on the full data space. It was observed that the improvement in subspace learning was accompanied by a very significant reduction in training times for all types of classifiers used.
Many organizations are nowadays keeping their data in the form of multi-level categories for easier manageability. An example of this is the Reuters Corpus which has news items categorized in a hierarchy of up to five levels. The volume and diversity of documents available in such category hierarchies is also increasing daily. As such, it becomes difficult for a traditional classifier to efficiently handle multi-level categorization of such a varied document space. In this paper, we present hybrid classifiers involving various two-classifier and four-classifier combinations for two-level text categorization. We show that the classification accuracy of the hybrid combination is better than the classification accuracies of all the corresponding single classifiers. The constituent classifiers of the hybrid combination operate on different subspaces obtained by semantic separation of data. Our experiments show that dividing a document space into different semantic subspaces increases the efficiency of such hybrid classifier combinations. We further show that hierarchies with a larger number of categories at the first level benefit more from this general hybrid architecture.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.