Web bots are used to automate client interactions with websites, which facilitates large-scale web measurements. However, websites may employ web bot detection. When they do, their response to a bot may differ from responses to regular browsers. The discrimination can result in deviating content, restriction of resources or even the exclusion of a bot from a website. This places strict restrictions upon studies: the more bot detection takes place, the more results must be manually verified to confirm the bot's findings. To investigate the extent to which bot detection occurs, we reverseanalysed commercial bot detection. We found that in part, bot detection relies on the values of browser properties and the presence of certain objects in the browser's DOM model. This part strongly resembles browser fingerprinting. We leveraged this for a generic approach to detect web bot detection: we identify what part of the browser fingerprint of a web bot uniquely identifies it as a web bot by contrasting its fingerprint with those of regular browsers. This leads to the fingerprint surface of a web bot. Any website accessing the fingerprint surface is then accessing a part unique to bots, and thus engaging in bot detection. We provide a characterisation of the fingerprint surface of 14 web bots. We show that the vast majority of these frameworks are uniquely identifiable through well-known fingerprinting techniques. We design a scanner to detect web bot detection based on the reverse analysis, augmented with the found fingerprint surfaces. In a scan of the Alexa Top 1 Million, we find that 12.8% of websites show indications of web bot detection.
To gauge adoption of web security measures, largescale testing of website security is needed. However, the diversity of modern websites makes a structured approach to testing a daunting task. This is especially a problem with respect to logging in: there are many subtle deviations in the flow of the login process between websites. Current efforts investigating login security typically are semi-automated, requiring manual intervention which does not scale well. Hence, comprehensive studies of post-login areas have not been possible yet. In this paper, we introduce Shepherd, a generic framework for logging in on websites. Given credentials, it provides a fully automated attempt at logging in. We discuss various design challenges related to automatically identifying login areas, validating correct logins, and detecting incorrect credentials. The tool collects data on successes and failures for each of these. We evaluate Shepherd's capabilities to login on thousands of sites, using unreliable, legitimately crowd-sourced credentials for a random selection from the Alexa Top websites list. Notwithstanding parked domains, invalid credentials, etc., Shepherd was able to automatically log in on 7,113 sites from this set, an order of magnitude beyond previous efforts at automating login.
Automated browsers are widely used to study the web at scale. Their premise is that they measure what regular browsers would encounter on the web. In practice, deviations due to detection of automation have been found. To what extent automated browsers can be improved to reduce such deviations has so far not been investigated in detail. In this paper, we investigate this for a specific web automation framework: OpenWPM, a popular research framework specifically designed to study web privacy. We analyse (1) detectability of OpenWPM, (2) prevalence of OpenWPM detection, and (3) integrity of OpenWPM's data recording.Our analysis reveals OpenWPM is easily detectable. We measure to what extent fingerprint-based detection is already leveraged against OpenWPM clients on 100,000 sites and observe that it is commonly detected (∼14% of front pages). Moreover, we discover integrated routines in scripts to specifically detect OpenWPM clients. Our investigation of Open-WPM's data recording integrity identifies novel evasion techniques and previously unknown attacks against OpenWPM's instrumentation. We investigate and develop mitigations to address the identified issues. In conclusion, we find that reliability of automation frameworks should not be taken for granted. Identifiability of such frameworks should be studied, and mitigations deployed, to improve reliability.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.