2008
DOI: 10.1145/1326561.1326564
|View full text |Cite
|
Sign up to set email alerts
|

Tracking Web spam with HTML style similarities

Abstract: Automatically generated content is ubiquitous in the web: dynamic sites built using the three-tier paradigm are good examples (e.g., commercial sites, blogs and other sites edited using web authoring software), as well as less legitimate spamdexing attempts (e.g., link farms, faked directories).Those pages built using the same generating method (template or script) share a common "look and feel" that is not easily detected by common text classification methods, but is more related to stylometry.In this work we… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
72
0
1

Year Published

2009
2009
2024
2024

Publication Types

Select...
5
2
2

Relationship

1
8

Authors

Journals

citations
Cited by 83 publications
(74 citation statements)
references
References 19 publications
0
72
0
1
Order By: Relevance
“…Due to the same independent hash values in the filled space, the error of the b-bit scheme corresponds to the error of the estimate of |S1∆S2|. Inaccuracy in just a few bit positions in the white space will yield a large relative error of the estimate of J. the scheme has been applied successfully in a variety of applications, including similarity search [4,5,6], association rule learning [8], compressing social networks [7], advertising diversification [11], tracking Web spam [21], web duplicate detection [15], large-scale learning [16], and more [1,3,14].…”
Section: Minwise Hashingmentioning
confidence: 99%
“…Due to the same independent hash values in the filled space, the error of the b-bit scheme corresponds to the error of the estimate of |S1∆S2|. Inaccuracy in just a few bit positions in the white space will yield a large relative error of the estimate of J. the scheme has been applied successfully in a variety of applications, including similarity search [4,5,6], association rule learning [8], compressing social networks [7], advertising diversification [11], tracking Web spam [21], web duplicate detection [15], large-scale learning [16], and more [1,3,14].…”
Section: Minwise Hashingmentioning
confidence: 99%
“…When referring to web content, structural features include the usage of HTML-encoded text, which includes the ability to format text with word bolding, italics, font coloring, font size, etc. [15]. Syntactic features refer to the sentencelevel of a document, including patterns used for formulating sentences.…”
Section: Stylometric and Text Analysismentioning
confidence: 99%
“…Invisible text usually means text in the same color as the background or text in layers which are behind the normal text or which are invisible. Again, much research has been performed to identify content spam, among others [13][14][15].…”
Section: Related Workmentioning
confidence: 99%