Looking into the past to better classify web spam

Nishioka, Dai; Davison, Brian D.; Qi, Xiaoguang

doi:10.1145/1531914.1531916

Cited by 35 publications

(23 citation statements)

References 14 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…To our knowledge, there has been no prior work discussing the re-use of closed banking websites. However, several researchers have observed that spammers sometimes re-register expired domains in order to benefit from the reputation of the old domain [2,3,4]. For instance, Hao et al found that spammers quickly register recently expired domains, much faster than non-spammers [4].…”

Section: Related Workmentioning

confidence: 99%

The Ghosts of Banking Past: Empirical Analysis of Closed Bank Websites

Moore

Clayton

2014

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Abstract.We study what happens to the domains used by US banks for their customer-facing websites when the bank is shut down or merges with another institution. The Federal Deposit Insurance Corporation (FDIC) publishes detailed statistical data about the many thousands of US banks, including their website URLs. We extracted details of the 3 181 banks that have closed their doors since 2003 and determined the fate of 2 302 domain names they are known to have used. We found that 47% are still owned by a banking institution but that 33% have passed into the hands of people who are exploiting the residual good reputation attached to the domain by hosting adverts, distributing malware or carrying out search engine optimization (SEO) activities. We map out the lifecycle of domain usage after the original institution no longer requires it as their main customer contact point -and explain our findings from an economic perspective. We present logistic regressions that help explain some of reasons why closed bank domains are let go, as well as why others choose to repurpose them. For instance, we find that smaller and troubled banks are more likely to lose control of their domains, and that the domains from bigger banks are more likely to be repurposed by others. We draw attention to other classes of domain that are best kept off the open market lest old botnets be revivified or other forms of criminality be resurrected. We end by exploring what the public policy options might be that would protect us all from ghost domains that are no longer being looked after by their original registrants.

show abstract

Section: Related Workmentioning

confidence: 99%

The Ghosts of Banking Past: Empirical Analysis of Closed Bank Websites

Moore

Clayton

2014

Lecture Notes in Computer Science

View full text Add to dashboard Cite

show abstract

“…New features based on time series [10,7,8] as well as normalization methods across different snapshots and TLDs are the expected outcome of the proposed tasks.…”

Section: Existing and Expected Filtering Technologiesmentioning

confidence: 99%

“…As an alternate solution, only feature sets such as "public" [5] can be made available; in this case a precompiled set of content change features based e.g. on [7] should also be compiled.…”

Section: Open Questionsmentioning

confidence: 99%

“…We describe new training and testing scenarios. New features may be generated by considering the temporal change of several crawl snapshots of the same domain [10,7,8]. In addition by the needs of collaboration across different archival institutions we may also provide training labels over one TLD and request prediction over a fully or partly unlabeled different domain.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Web spam challenge proposal for filtering in archives

Benczúr

Erdélyi

Masanès

et al. 2009

Proceedings of the 5th International Workshop on Adversarial Information Retrieval on the Web

View full text Add to dashboard Cite

In this paper we propose new tasks for a possible future Web Spam Challenge motivated by the needs of the archival community. The Web archival community consists of several relatively small institutions that operate independently and possibly over different top level domains (TLDs). Each of them may have a large set of historic crawls. Efficient filtering would hence require (1) enhanced use of the time series of domain snapshots and (2) collaboration by transferring models across different TLDs. Corresponding Challenge tasks could hence include the distribution of crawl snapshot data for feature generation as well as classification of unlabeled new crawls of the same or even different TLDs.

show abstract

“…Wu and Davison [23] expand from a seed set of spam pages to the neighbors to find more suspicious pages in the web graph. Dai et al [5] exploit the historical content information of web pages to improve spam classification, while Chung et al [4] propose to use time series to study the link farm evolution. Martinez-Romo and Araujo [18] apply a language model approach to improve web spam identification.…”

Section: Introductionmentioning

confidence: 99%

Web Spam Detection Using Link-Based Ant Colony Optimization

Taweesiriwate

Manaskasemsak

Rungsawang

2012

2012 IEEE 26th International Conference on Advanced Information Networking and Applications

View full text Add to dashboard Cite

Web spam is one of the most important problems which degrade quality and efficiency of web search engines. In this paper, we present a novel link-based ant colony optimization learning algorithm for spam host detection. The host graph is first constructed by aggregating pages' hyperlink structure. Following the TrustRank assumption, ants start walking from a normal host and randomly follow host links with a probability distribution. Then, the classification rules are appropriately generated according to common features of normal hosts sequentially discovered by ants. From the experiments with the WEBSPAM-UK2006 dataset, the proposed learning model provides much accuracy in classifying both normal and spam hosts than several baselines, including a state of the art C4.5. Moreover, we also provide an analysis in parameter tuning for better results.

show abstract

Looking into the past to better classify web spam

Cited by 35 publications

References 14 publications

The Ghosts of Banking Past: Empirical Analysis of Closed Bank Websites

The Ghosts of Banking Past: Empirical Analysis of Closed Bank Websites

Web spam challenge proposal for filtering in archives

Web Spam Detection Using Link-Based Ant Colony Optimization

Contact Info

Product

Resources

About