2010
DOI: 10.1007/s10994-010-5171-1
|View full text |Cite
|
Sign up to set email alerts
|

Graph regularization methods for Web spam detection

Abstract: We present an algorithm, WITCH, that learns to detect spam hosts or pages on the Web. Unlike most other approaches, it simultaneously exploits the structure of the Web graph as well as page contents and features. The method is efficient, scalable, and provides state-of-the-art accuracy on a standard Web spam benchmark.

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

0
25
0
1

Year Published

2010
2010
2017
2017

Publication Types

Select...
5
2
1

Relationship

0
8

Authors

Journals

citations
Cited by 51 publications
(26 citation statements)
references
References 18 publications
0
25
0
1
Order By: Relevance
“…We extend an approach first popularized in the web-spam detection domain [1] to the images linked to web pages. For each image we calculate pixel-based and text-based features (which are concatenated into a vector x i ) and take into account an image's position in the web graph (based on the directed edges E).…”
Section: Approachmentioning
confidence: 99%
See 2 more Smart Citations
“…We extend an approach first popularized in the web-spam detection domain [1] to the images linked to web pages. For each image we calculate pixel-based and text-based features (which are concatenated into a vector x i ) and take into account an image's position in the web graph (based on the directed edges E).…”
Section: Approachmentioning
confidence: 99%
“…We exploit the web graph, or hyperlink information, by doing graph regularization to constrain the predicted scores to vary smoothly between the linked pages. We extend a web-pagespam detection approach [1] to predict the image score in a single optimization framework.…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation
“…al., [13] modified the HIST algorithm to detect the spam pages based on link cheating. Jacob [14] detected the spam page based on the network graph regularization. The third cheating method is hiding technology, which hides web content by setting key words or other webpage content to the same color of the background color, and then the page content "lost" in the same color background.…”
Section: Introductionmentioning
confidence: 99%
“…Para combater esse problema, diversos métodos vêm sendo propostos na literatura, sendo que alguns analisam apenas spam links [16,31], outros fazem somente análise de conteúdo [26,36] e, ainda, aqueles que analisam tanto os links quanto o conteúdo [1,11,28]. Entre tais propostas, as que vêm obtendo maior sucesso são as técni-cas de aprendizado de máquina, tais como seleção de conjuntos (ensemble selection) [15,17], clustering [11,22], floresta aleatória [15], boosting [15,17], máquinas de vetores de suporte [29,35] e árvores de decisão [11,16].…”
Section: Introductionunclassified