2008 8th IEEE International Conference on Computer and Information Technology 2008
DOI: 10.1109/cit.2008.4594656
|View full text |Cite
|
Sign up to set email alerts
|

The study on Detecting Near-Duplicate WebPages

Abstract: Reprinting information among websites produces a great deal redundant WebPages. To improve search efficiency and user satisfaction, an algorithm to Detect near-Duplicate WebPages (DDW) is proposed. In the course of developing a near-duplicate detection system for a multi-billion page repository, we make two research contributions. First, we consider both syntactic and semantic information to present and compute documents' similarities. Second, after classifying web-pages into different categories, we index fea… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1

Citation Types

0
2
0
1

Year Published

2012
2012
2012
2012

Publication Types

Select...
1

Relationship

0
1

Authors

Journals

citations
Cited by 1 publication
(3 citation statements)
references
References 18 publications
0
2
0
1
Order By: Relevance
“…However in practice there probably still are some noises left. For example, " [1] [2] [3] [ 我来说两句]", "相关专 题:法治在线节目实录" etc. These often occurred in the beginning of the texts; second, some of the ads and comments, which were added to the duplicated web pages, might not be removed completely during the purification.…”
Section: 法。这种方法将网页内容看成字符流,以一些标点符号和常用汉字 作为锚点,从网页内容中抽取出文字作为网页特征码。mentioning
confidence: 99%
See 2 more Smart Citations
“…However in practice there probably still are some noises left. For example, " [1] [2] [3] [ 我来说两句]", "相关专 题:法治在线节目实录" etc. These often occurred in the beginning of the texts; second, some of the ads and comments, which were added to the duplicated web pages, might not be removed completely during the purification.…”
Section: 法。这种方法将网页内容看成字符流,以一些标点符号和常用汉字 作为锚点,从网页内容中抽取出文字作为网页特征码。mentioning
confidence: 99%
“…The long text is: 在神七科研攻关中,哈尔滨工业大学承担着宇航员舱外宇航服的地 面实验系统,学院气动技术中心课题组负责"水平舱环控系统改造"和 "紧急复压系统"的研制。……而在神七发射前 38 分钟,上海的一名 科研人员拔掉地面连接神七的最后一个重要插座,成为最后一位撤离发 射架的人。 3 Experimental results of the extracted feature codes are listed in TABLE I. S A'A is the repeatability of the benchmark over the testing documents and S AA' is the repeatability of the testing documents over the benchmark.…”
Section: A Evaluation On the Noise-tolerance Ability Of The Feature unclassified
See 1 more Smart Citation