2012
DOI: 10.4304/jsw.7.11.262-2629
|View full text |Cite
|
Sign up to set email alerts
|

A Length-variable Feature Code Based Fuzzy Duplicates Elimination Approach for Large Scale Chinese WebPages

Abstract:

Most of the existing Chinese webpage duplicate elimination approaches do not focus on noisy and fuzzy duplicates elimination. In this paper, we propose an efficient and noise-tolerant Chinese webpage duplicate elimination approach based on Length-variable Feature Code. First, an Independent Extraction Unit is defined to eliminate the impact of short paragraphs on feature code extraction. Then the concept of repeatability is introduced by using the longest common substring to enhance the noi… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2

Citation Types

0
2
0

Year Published

2013
2013
2015
2015

Publication Types

Select...
3

Relationship

0
3

Authors

Journals

citations
Cited by 3 publications
(2 citation statements)
references
References 8 publications
0
2
0
Order By: Relevance
“…Social network shows the trend of rapid development with the popularity of Internet users, not only the number of users with explosive growth, but also its service form is also changing rapidly. In recent years, a large number of new social networking services constantly emerging, of which weak relationship between social network service of represented by sina weibo at home both and Facebook at abroad is becoming a major form of social network [1][2][3].Different from the traditional social network, due to the unidirectional of the weak relationship, based on the nodes of social network of weak relationship (that is, a one-way relationship) presents obvious heterogeneity characteristics, including a large number of users in a natural man as major body node (e.g., "zhangsan") and in the media, institutions and various sources as the main theme node (e.g., "the weather of Beijing", "south weekend", "popular video", etc.). Among them, the user nodes, usually as a message subscriber, one-way attention to a large number of topics node, the one-way subscription relationship, often based on the user tendency of interest for different types of theme; at the same time the user node often form a two-way relationship with other users, this is usually based on the user's real social relations [4][5][6].…”
Section: Introductionmentioning
confidence: 99%
“…Social network shows the trend of rapid development with the popularity of Internet users, not only the number of users with explosive growth, but also its service form is also changing rapidly. In recent years, a large number of new social networking services constantly emerging, of which weak relationship between social network service of represented by sina weibo at home both and Facebook at abroad is becoming a major form of social network [1][2][3].Different from the traditional social network, due to the unidirectional of the weak relationship, based on the nodes of social network of weak relationship (that is, a one-way relationship) presents obvious heterogeneity characteristics, including a large number of users in a natural man as major body node (e.g., "zhangsan") and in the media, institutions and various sources as the main theme node (e.g., "the weather of Beijing", "south weekend", "popular video", etc.). Among them, the user nodes, usually as a message subscriber, one-way attention to a large number of topics node, the one-way subscription relationship, often based on the user tendency of interest for different types of theme; at the same time the user node often form a two-way relationship with other users, this is usually based on the user's real social relations [4][5][6].…”
Section: Introductionmentioning
confidence: 99%
“…Deduplication can identify redundant data, eliminate all but one copy, and create local pointers to the information that users can access. This technology has been widespread concerned by industry and academia [2,3,4,5].…”
Section: Introductionmentioning
confidence: 99%