Strong encryption algorithms and reliable anonymity routing have made cybercrime investigation more challenging. Hence, one option for law enforcement agencies (LEAs) is to search through unencrypted content on the Internet or anonymous communication networks (ACNs). The capability of automatically harvesting web content from web servers enables LEAs to collect and preserve data prone to serve as potential leads, clues, or evidence in an investigation. Although scientific studies have explored the field of web crawling soon after the inception of the web, few research studies have thoroughly scrutinised web crawling on the "dark web" or via ACNs such as I2P, IPFS, Freenet, and Tor. The current paper presents a systematic literature review (SLR) that examines the prevalence and characteristics of dark web crawlers. From a selection of 58 peer-reviewed articles mentioning crawling and the dark web, 34 remained after excluding irrelevant articles. The literature review showed that most dark web crawlers were programmed in Python, using either Selenium or Scrapy as the web scraping library. The knowledge gathered from the systematic literature review was used to develop a Tor-based web crawling model into an already existing software toolset customised for ACN-based investigations. Finally, the performance of the model was examined through a set of experiments. The results indicate that the developed crawler was successful in scraping web content from both clear and dark web pages, and scraping dark marketplaces on the Tor network. The scientific contribution of this paper entails novel knowledge concerning ACN-based web crawlers. Furthermore, it presents a model for crawling and scraping clear and dark websites for the purpose of digital investigations. The conclusions include practical implications of dark web content retrieval and archival, such as investigation clues and evidence, and the related future research topics.INDEX TERMS cybercrime, digital forensics, systematic literature review, dark web crawling, Tor
The transformation of the contemporary societies through digital technologies has had a profound effect on all human activities including those that are in the realm of illegal, unlawful, and criminal deeds. Moreover, the affordances provided by the anonymity creating techniques such as the Tor protocol which are beneficial for preserving civil liberties, appear to be highly profitable for various types of miscreants whose crimes range from human trafficking, arms trading, and child pornography to selling controlled substances and racketeering. The Tor similar technologies are the foundation of a vast, often mysterious, sometimes anecdotal, and occasionally dangerous space termed as the Dark Web. Using the features that make the Internet a uniquely generative knowledge agglomeration, with no borders, and permeating different jurisdictions, the Dark Web is a source of perpetual challenges for both national and international law enforcement agencies. The anonymity granted to the wrong people increases the complexity and the cost of identifying both the crimes and the criminals, which is often exacerbated with lack of proper human resources. Technologies such as machine learning and artificial intelligence come to the rescue through automation, intensive data harvesting, and analysis built into various types of web crawlers to explore and identify dark markets and the people behind them. It is essential for an effective and efficient crawling to have a pool of dark sites or onion URLs. The research study presents a way to build a crawling mechanism by extracting onion URLs from malicious executables by running them in a sandbox environment and then analysing the log file using machine learning algorithms. By discerning between the malware that uses the Tor network and the one that does not, we were able to classify the Tor using malware with an accuracy rate of 91% with a logistic regression algorithm. The initial results suggest that it is possible to use this machine learning approach to diagnose new malicious servers on the Tor network. Embedding this kind of mechanism into the crawler may also induce predictability, and thus efficiency in recognising dark market activities, and consequently, their closure.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.