This paper is on a graph clustering scheme inspired by ensemble learning. In short, the idea of ensemble learning is to learn several weak classifiers and use these weak classifiers to determine a strong classifier. In this contribution, we use the generic procedure of ensemble learning and determine several weak graph clusterings (with respect to the objective function). From the partition given by the maximal overlap of these clusterings (the cluster cores), we continue the search for a strong clustering. We demonstrate the performance of this scheme by using it to maximize the modularity of a graph clustering. We show, that the quality of the initial weak clusterings is of minor importance for the quality of the final result of the scheme if we iterate the process of restarting from maximal overlaps.
Despite growing speculation about the role of human behavior in cyber-security of machines, concrete data-driven analysis and evidence have been lacking. Using Symantec’s WINE platform, we conduct a detailed study of 1.6 million machines over an 8-month period in order to learn the relationship between user behavior and cyber attacks against their personal computers. We classify users into 4 categories (gamers, professionals, software developers, and others, plus a fifth category comprising everyone) and identify a total of 7 features that act as proxies for human behavior. For each of the 35 possible combinations (5 categories times 7 features), we studied the relationship between each of these seven features and one dependent variable, namely the number of attempted malware attacks detected by Symantec on the machine. Our results show that there is a strong relationship between several features and the number of attempted malware attacks. Had these hosts not been protected by Symantec’s anti-virus product or a similar product, they would likely have been infected. Surprisingly, our results show that software developers are more at risk of engaging in risky cyber-behavior than other categories.
Partitioning large networks into smaller subnetworks (communities) is an important tool to analyze the structure of complex linked systems. In recent years, many inmemory community detection algorithms have been proposed for graphs with millions of edges. Analyzing massive graphs with billions of edges is impossible for existing algorithms. In this contribution, we show how to find community partitions of networks with billions of edges. Our approach is based on an ensemble learning scheme for community detection that provides a way to identify high quality partitions from an ensemble of partitions with lower quality. We present a pre-processing procedure for community detection algorithms that significantly decreases the problem size. After reducing the problem size, traditional non-distributed community detection algorithms can be applied. We implemented a weak but highly scalable label propagation algorithm on top of the distributed-computing framework Apache Hadoop. The evaluation of our implementation on a 50-node Hadoop cluster and with evaluation datasets up to 3.3 billion edges shows very good results with respect to clustering quality as well as scalability. For a smaller 260 million edge network, we show that our preprocessing can improve the results of the popular Louvain modularity clustering algorithm.
The modularity function is a widely used measure for the quality of a graph clustering. Finding a clustering with maximal modularity is NP-hard. Thus, only heuristic algorithms are capable of processing large datasets. Extensive literature on such heuristics has been published in the recent years. We present a fast randomized greedy algorithm which uses solely local information on gradients of the objective function. Furthermore, we present an approach which first identifies the 'cores' of clusters before calculating the final clustering. The global heuristic of identifying core groups solves problems associated with pure local approaches. With the presented algorithms we were able to calculate for many realworld datasets a clustering with a higher modularity than any algorithm before.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.