-There is an abundance of duplicated web documents on the internet. For example, two documents online could be very similar to each other except for a very small portion, such as URLs and advertisements. While such differences are not important with regards to web searches, they do tamper with web search results due to duplication. Therefore, if web crawlers could check the duplication percentage of newly crawled pages by a previously crawled page, the quality of web search will significantly increase. The main objective of this research is to propose a method which is able to check the duplication ratio of the content on the page with the one already crawled previously. The solution includes running a web crawling algorithm in order to calculate the ratio of duplication at the time of web crawling. In order to effectively achieve the goals of this research, Charikar's SIMHASH finger printing-technique has been used. Using this, a new technique for the purpose of detection of exact and near duplication method will be devised which will work to check the duplication ratio with the newly crawled page. The experiment is carried out on multiple pages of two major B2B website namely Ali Baba and Trade key. More than 300 pages from two similar categories on each portal were selected for this experiment. These selected pages were first calculated using a third party duplication detection tool to set the bench mark. The results obtained from the test looked to be very promising and close to the benchmark set. The system running time was very short. However, the results show an average curve variation of 10% away from the bench mark which in this case is fine. Based on the results obtained from the experiment carried out, it can be said that Charikar's SIMHASH finger printing technique can be effectively used to detect duplication and near duplication.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.