Abstract:Web mining is an application of data mining techniques to extract and process knowledge from sources such as web documents, hyperlinks, website usage logs etc. Due to the outbreak of extensive information through web resources, mining of semantically relevant data for a search keyword is one of the most intriguing studies to research. This work deals with automatically extracting the information from web documents using web content mining. The extracted data needs to be preprocessed in order to obtain the appropriate data format for further analysis. Generally, the content based search algorithm is used to find the items relevant to the keyword searched resulting in an indexed set of similar results. Further, the clustering of the similar data is done by adopting the quality threshold clustering algorithm assigning a similarity index to each of the result items. For the final list of items obtained, the weighted page ranking algorithm is applied to rank the most frequently searched item in the lists. The proposed work efficiency will be determined by the cluster's quality and the query blocks ranking efficiency. Various metrics like cluster purity, NMI, Rand Index, F-Measure, wPRF are used to evaluate the quality and the ranking efficiency of the search result obtained. Duplicate result sets are handled and are castigated for better unambiguous results. The results obtained is proved to be better and to overrun the existing approaches like QFI and QFJ. The quality of the result set obtained is further evaluated by repeating the process considering only the top n ranked items, shuffling the top items or by randomly selecting the items. Thus, enabling to validate and uphold the results of the proposed work surpassing the existing algorithms.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.