Abstract:Web mining is an application of data mining techniques to extract and process knowledge from sources such as web documents, hyperlinks, website usage logs etc. Due to the outbreak of extensive information through web resources, mining of semantically relevant data for a search keyword is one of the most intriguing studies to research. This work deals with automatically extracting the information from web documents using web content mining. The extracted data needs to be preprocessed in order to obtain the appropriate data format for further analysis. Generally, the content based search algorithm is used to find the items relevant to the keyword searched resulting in an indexed set of similar results. Further, the clustering of the similar data is done by adopting the quality threshold clustering algorithm assigning a similarity index to each of the result items. For the final list of items obtained, the weighted page ranking algorithm is applied to rank the most frequently searched item in the lists. The proposed work efficiency will be determined by the cluster's quality and the query blocks ranking efficiency. Various metrics like cluster purity, NMI, Rand Index, F-Measure, wPRF are used to evaluate the quality and the ranking efficiency of the search result obtained. Duplicate result sets are handled and are castigated for better unambiguous results. The results obtained is proved to be better and to overrun the existing approaches like QFI and QFJ. The quality of the result set obtained is further evaluated by repeating the process considering only the top n ranked items, shuffling the top items or by randomly selecting the items. Thus, enabling to validate and uphold the results of the proposed work surpassing the existing algorithms.