Dimensionality refers to number of terms in a web page. While classifying web pages high dimensionality of web pages causes problem. The main objective of reducing dimensionality of web pages is improving the performance of classifier. Processing time and accuracy are two parameters which influence the performance of a classifier. To reduce the processing time, less informative and redundant terms have to be removed from web pages.This research describes hybrid approach for dimensionality reduction in web page classification using a rough set and naïve Bayesian method. Feature selection and dimensionality reduction methods are used for reducing the dimensionality. Information gain method is used as feature selection method. Rough set based Quick Reduct algorithm is used for dimensionality reduction. Naïve Bayesian method is used for classifying web pages to optimal predefined categories. Assignment of web pages to category is based on maximum posterior probability. Words remaining after the process of feature selection and dimensionality reduction will be given to the classifier. Finally the classifier will assign most optimal predefined category to web pages.
Today there is huge amount of data available on World Wide Web. One way to manage data is web page classification. One of the issues of web page classification considered in this paper is high dimensionality. Dimensionality refers to number of terms in a web page. High dimensionality of web pages causes problem while classifying them. The main objective of reducing dimensionality of web pages is to improve the performance of the classifier. This paper describes hybrid approach of dimensionality reduction for web page classification using a rough set and information gain method. Feature selection and dimensionality reduction methods are used to reduce the dimensionality of web pages. Information gain method is used as feature selection method. Rough set based Quick Reduct algorithm is used for dimensionality reduction. Web pages are classified using naïve Bayesian method. Significant results are obtained and tested for proposed architecture.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.