Search citation statements
Paper Sections
Citation Types
Year Published
Publication Types
Relationship
Authors
Journals
World Wide Web (www) is a large repository of information which contains a plethora of information in the form of web documents. Information stored in web is increasing at a very rapid rate and people rely more and more on Internet for acquiring information. Internet World Stats reveal that world Internet usage has increased by 480 % within the period 2000-2011. This exponential growth of the web has made it a difficult task to organize data and to find it. If we categorize data on the Internet, it would be easier to find relevant piece of information quickly and conveniently. There are some popular web directories projects like yahoo directory and Mozilla directory in which web pages are organized according to their categories. According to a recent survey, it has been estimated that about 584 million websites are currently hosted on the Internet. But these Internet directories have only a tiny fraction of websites listed with them. The proper classification has made these directories popular among web users. However these web directories make use of human effort for classifying web pages and also only 2.5 % of available webpages are included in these directories. Rapid growth of web has made it increasingly difficult to classify web pages manually, mainly due to the fact that manually or semi-automatic classification of website is a tedious and costly affair. Because of this reason web page classification using machine learning algorithms has become a major research topic in these days. A number of algorithms have been proposed for the classification of web sites by analyzing its features. In this paper we will introduce a fast, effective, probabilistic classification model with a good accuracy based on machine learning and data mining techniques for the automated classification of web-pages into different categories based on their textual content.Keywords Content classification Á Machine learning Á Naïve Bayesian Á Webmining Á Probabilistic models Á Web-page classification IntroductionThe World Wide Web (WWW) started in the year 1991 and has shown a rapid growth within last two decades. It is estimated that today more than 3.5 billion people are using internet. The number of people using the Internet is rapidly increasing. Internet World Statistics reveals that the world Internet usage growth has increased by 480.4 % during the period 2000-2011 [1]. It was also observed that more than 35 % data available in whole world is stored in Internet.According to Netcraft's January 2012 survey, more than 584 million web sites exist on the Internet and out of which, nearly 175.2 million are active. Today there are several different tools available for an average Internet user to locate and identify relevant information on the Internet. These tools can be broadly classified as (1) Crawler based Search Engines (SE) e.g., Google, Bing, Yahoo, Duck-DuckGo etc., (2) Meta Search engines e.g., Metacrawler, Clusty etc., and (3) Subject Directories like DMOZ (Directory Mozilla), Librarians Internet Index (LII) etc. The c...
World Wide Web (www) is a large repository of information which contains a plethora of information in the form of web documents. Information stored in web is increasing at a very rapid rate and people rely more and more on Internet for acquiring information. Internet World Stats reveal that world Internet usage has increased by 480 % within the period 2000-2011. This exponential growth of the web has made it a difficult task to organize data and to find it. If we categorize data on the Internet, it would be easier to find relevant piece of information quickly and conveniently. There are some popular web directories projects like yahoo directory and Mozilla directory in which web pages are organized according to their categories. According to a recent survey, it has been estimated that about 584 million websites are currently hosted on the Internet. But these Internet directories have only a tiny fraction of websites listed with them. The proper classification has made these directories popular among web users. However these web directories make use of human effort for classifying web pages and also only 2.5 % of available webpages are included in these directories. Rapid growth of web has made it increasingly difficult to classify web pages manually, mainly due to the fact that manually or semi-automatic classification of website is a tedious and costly affair. Because of this reason web page classification using machine learning algorithms has become a major research topic in these days. A number of algorithms have been proposed for the classification of web sites by analyzing its features. In this paper we will introduce a fast, effective, probabilistic classification model with a good accuracy based on machine learning and data mining techniques for the automated classification of web-pages into different categories based on their textual content.Keywords Content classification Á Machine learning Á Naïve Bayesian Á Webmining Á Probabilistic models Á Web-page classification IntroductionThe World Wide Web (WWW) started in the year 1991 and has shown a rapid growth within last two decades. It is estimated that today more than 3.5 billion people are using internet. The number of people using the Internet is rapidly increasing. Internet World Statistics reveals that the world Internet usage growth has increased by 480.4 % during the period 2000-2011 [1]. It was also observed that more than 35 % data available in whole world is stored in Internet.According to Netcraft's January 2012 survey, more than 584 million web sites exist on the Internet and out of which, nearly 175.2 million are active. Today there are several different tools available for an average Internet user to locate and identify relevant information on the Internet. These tools can be broadly classified as (1) Crawler based Search Engines (SE) e.g., Google, Bing, Yahoo, Duck-DuckGo etc., (2) Meta Search engines e.g., Metacrawler, Clusty etc., and (3) Subject Directories like DMOZ (Directory Mozilla), Librarians Internet Index (LII) etc. The c...
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.