The study on Detecting Near-Duplicate WebPages

Cao, Yujuan; Niu, Zhendong; Wang, Weiqiang; Zhao, Kun

doi:10.1109/cit.2008.4594656

Cited by 1 publication

(3 citation statements)

References 18 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…However in practice there probably still are some noises left. For example, " [1] [2] [3] [ 我来说两句]", "相关专题：法治在线节目实录" etc. These often occurred in the beginning of the texts; second, some of the ads and comments, which were added to the duplicated web pages, might not be removed completely during the purification.…”

Section: 法。这种方法将网页内容看成字符流，以一些标点符号和常用汉字作为锚点，从网页内容中抽取出文字作为网页特征码。mentioning

confidence: 99%

“…The long text is: 在神七科研攻关中，哈尔滨工业大学承担着宇航员舱外宇航服的地面实验系统，学院气动技术中心课题组负责"水平舱环控系统改造"和 "紧急复压系统"的研制。……而在神七发射前 38 分钟，上海的一名科研人员拔掉地面连接神七的最后一个重要插座，成为最后一位撤离发射架的人。 3 Experimental results of the extracted feature codes are listed in TABLE I. S A'A is the repeatability of the benchmark over the testing documents and S AA' is the repeatability of the testing documents over the benchmark.…”

Section: A Evaluation On the Noise-tolerance Ability Of The Feature unclassified

“…Different kinds of approaches have been proposed for duplicates detection and elimination in recent years [2][3][4][5][6]. However most of them do not focus on noisy and fuzzy duplicates elimination.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

A Length-variable Feature Code Based Fuzzy Duplicates Elimination Approach for Large Scale Chinese WebPages

Guo¹,

Chen²,

Cong³

et al. 2012

JSW

View full text Add to dashboard Cite

Most of the existing Chinese webpage duplicate elimination approaches do not focus on noisy and fuzzy duplicates elimination. In this paper, we propose an efficient and noise-tolerant Chinese webpage duplicate elimination approach based on Length-variable Feature Code. First, an Independent Extraction Unit is defined to eliminate the impact of short paragraphs on feature code extraction. Then the concept of repeatability is introduced by using the longest common substring to enhance the noise tolerant capability. Experimental results on 10 million webpage dataset show that the proposed approach can efficiently deal with duplicates from massive WebPages with the duplicate elimination precision of 99.03%.

show abstract

Section: 法。这种方法将网页内容看成字符流，以一些标点符号和常用汉字作为锚点，从网页内容中抽取出文字作为网页特征码。mentioning

confidence: 99%

Section: A Evaluation On the Noise-tolerance Ability Of The Feature unclassified