By supplying different versions of a web page to search engines and to browsers, a content provider attempts to cloak the real content from the view of the search engine. Semantic cloaking refers to differences in meaning between pages which have the effect of deceiving search engine ranking algorithms. In this paper, we propose an automated two-step method to detect semantic cloaking pages based on different copies of the same page downloaded by a web crawler and a web browser. The first step is a filtering step, which generates a candidate list of semantic cloaking pages. In the second step, a classifier is used to detect semantic cloaking pages from the candidates generated by the filtering step. Experiments on manually labeled data sets show that we can generate a classifier with a precision of 93% and a recall of 85%. We apply our approach to links from the dmoz Open Directory Project and estimate that more than 50,000 of these pages employ semantic cloaking.
The back projection method is a tremendously powerful technique for investigating the time dependent earthquake source, but its physical interpretation is elusive. We investigate how earthquake rupture heterogeneity and directivity can affect back‐projection results (imaged location and beam power) using synthetic earthquake models. Rather than attempting to model the dynamics of any specific real earthquake, we use idealized kinematic rupture models, with constant or varying rupture velocity, peak slip rate, and fault‐local strike orientation along unilateral or bilateral rupturing faults, and perform back‐projection with the resultant synthetic seismograms. Our experiments show back‐projection can track only heterogeneous rupture processes; homogeneous rupture is not resolved in our synthetic experiments. The amplitude of beam power does not necessarily correlate with the amplitude of any specific rupture parameter (e.g., slip rate or rupture velocity) at the back‐projected location. Rather, it depends on the spatial heterogeneity around the back‐projected rupture front, and is affected by the rupture directivity. A shorter characteristic wavelength of the source heterogeneity or rupture directivity toward the array results in strong beam power in higher frequency. We derive an equation based on Doppler theory to relate the wavelength of heterogeneity with synthetic seismogram frequency. This theoretical relation can explain the frequency‐ and array‐dependent back‐projection results not only in our synthetic experiments but also to analyze the 2019 M7.6 bilaterally rupturing New Ireland earthquake. Our study provides a novel perspective to physically interpret back‐projection results and to retrieve information about earthquake rupture characteristics.
Link spam deliberately manipulates hyperlinks between web pages in order to unduly boost the search engine ranking of one or more target pages. Link based ranking algorithms such as PageRank, HITS, and other derivatives are especially vulnerable to link spam. Link farms and link exchanges are two common instances of link spam that produce spam communities-i.e., clusters in the web graph. In this paper, we present a directed approach to extracting link spam communities when given one or more members of the community. In contrast to previous completely automated approaches to finding link spam, our method is specifically designed to be used interactively. Our approach starts with a small spam seed set provided by the user and simulates a random walk on the web graph. The random walk is biased to explore the local neighborhood around the seed set through the use of decay probabilities. Truncation is used to retain only the most frequently visited nodes. After termination, the nodes are sorted in decreasing order of their final probabilities and presented to the user. Experiments using manually labeled link spam data sets and random walks from a single seed domain show that the approach achieves over 95.12% precision in extracting large link farms and 80.46% precision in extracting link exchange centroids.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.