ToRank: Identifying the most influential suspicious domains in the Tor network

Al-Nabki, Mhd Wesam; Fidalgo, Eduardo; Alegre, Enrique; Fernández-Robles, Laura

doi:10.1016/j.eswa.2019.01.029

Cited by 74 publications

(65 citation statements)

References 44 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…IDU is initialized with a list of onion domains that were classified as illegal -using TCU -along with their extracted features -using TMU -to assign each onion domain a rank value that reflects its popularity among the rest. In our work, we explore two ranking approaches: link-based (Al-Nabki et al, 2019c) and content-based (Al-Nabki et al, 2019a).…”

Section: Influence Detection Unitmentioning

confidence: 99%

“…Later, the growing risk of the suspicious activities practiced on the Darknet called the attention of researchers to dive into this network and explore its content (Graczyk and Kinningham, 2015;Moore and Rid, 2016;Park et al, 2018;Dalins et al, 2018;Dong et al, 2018;Takaaki and Atsuo, 2019;Al-Nabki et al, 2019c). Concerning the Deep Web, Xu et al (2007) presented a supervised text classification framework and used Information Gain (IG) to extract features from the text.…”

Section: Supervised Classification Techniquesmentioning

confidence: 99%

“…By Comparing these classification models on the DUTA dataset, they discovered that the combination of TF-IDF with LR obtains the best performances. Lately, Al-Nabki et al (2019c) released a new version of DUTA, called DUTA-10K, that extends DUTA to 10, 367 labeled samples. Recently, Dalins et al (2018) presented a novel model, named Tor-use Motivation Model (TMM), to classify the hidden services of the Tor network with greater granularity.…”

Section: Supervised Classification Techniquesmentioning

confidence: 99%

“…To isolate the interesting images, we built a supervised image classifier that categorizes the visual content into nine categories, where eight of them are suspicious, and one is not. The definition of these categories is based on our previous works , 2019c, where we created DUTA dataset and its extended version, DUTA-10K. The image classification model was built using the Transfer Learning (TL) technique (Lu et al, 2015) on the top of a pre-trained Inception-ResNet V2 model (Szegedy et al, 2017).…”

Section: Domain Headermentioning

confidence: 99%

“…ToRank Value. ToRank is a link-based ranking algorithm to order the items of a given network following their centrality (Al-Nabki et al, 2019c). We applied ToRank to the Tor network to rank the onion domains, and we used the assigned rank as a feature of the node.…”

Section: Majority Classmentioning

confidence: 99%

See 4 more Smart Citations

Supervised machine learning for classification mining and ranking of illegal web contents = Aprendizaje automático supervisado para la clasificación, extracción de conocimiento y ordenación de contenidos web ilegales

Nabki¹,

Wesam²

View full text Add to dashboard Cite

In this thesis, we propose new algorithms, methods, and datasets that can be used to classify, to mine information, and rank web domains or similar resources containing text. Motivated by our joint work with INCIBE, we focus our efforts on detecting web resources which content could indicate illegal activities. Most of these textual web pages are hosted in a darknet, and, because of that, we centered our analysis in The Onion Router (Tor) Darknet, based on the common belief that this net hosts plenty of criminal activities. Additionally, we also addressed the same problem in Online Notepad Services (ONS), in particular, Pastebin service.Several of the contributions that we present here are already incorporated in tools developed by INCIBE that help Spanish Law Enforcement Agencies (LEAs) to monitor the contents of the Tor Darknet. Our work relies on the application of machine learning, both classical and deep, using most of the time supervised learning. This approach required the creation of different datasets, naming the first of them as Darknet Usage Text Addresses (DUTA), which contained 6, 831 labeled samples distributed over 26 classes. Posteriorly, we extended this dataset up to 10, 367 samples, naming it as DUTA-10K.Using DUTA, we evaluated the combination of two text representation techniques with three well-known classifiers to categorize the Tor domains. The combination of TF-IDF words representation with Logistic Regression achieved a 93.7% macro F1 score, in a subset of DUTA where eight categories of illegal activities were selected. To classify Pastebin contents, we use Active Learning to select and label only the most informative samples, reducing in this way, the cost of building a labeled dataset. Our design requires three cascade classifiers, saying the last one whether a sample belongs to one out of six categories related to criminal activities, obtaining an average class recall of 95.24% as binary, and 80.33% as multiclass.To enrich the information that we provide to LEAs, we develop first a semi-automatic algorithm to identify emerging products in Tor marketplaces. Using Graph Theory, we build a Products Correlations Graph (PCG), in which the nodes are the markets' products, and the edges reflect the simultaneous offering of two products in the same market. Our algorithm decomposes the PCG, using the k-shell algorithm, and analyzes the connectivity of the products in the core-shell. We apply this method to drug Hidden Services (HS) in DUTA, finding that MDMA and Ecstasy were the most emerging drug products during the analyzed period. Second, we used Named Entity Recognition (NER) to recognize rare and emerging named entities in noisy user-generated text. We overcome the use of gazetteers to incorporate external resources to neural network architectures, presenting a novel feature that we named Local Distance Neighbor (LDN), obtaining in this way the state-of-the-art F1 score on three categories of the W-NUT-2017 dataset: Group, Person, and Product. Furthermore, we present an application of NER...

show abstract

Section: Influence Detection Unitmentioning

confidence: 99%