Clustering algorithms have emerged as an alternative powerful meta-learning tool to accurately analyze the massive volume of data generated by modern applications. In particular, their main goal is to categorize data into clusters such that objects are grouped in the same cluster when they are similar according to specific metrics. There is a vast body of knowledge in the area of clustering and there has been attempts to analyze and categorize them for a larger number of applications. However, one of the major issues in using clustering algorithms for big data that causes confusion amongst practitioners is the lack of consensus in the definition of their properties as well as a lack of formal categorization. With the intention of alleviating these problems, this paper introduces concepts and algorithms related to clustering, a concise survey of existing (clustering) algorithms as well as providing a comparison, both from a theoretical and an empirical perspective. From a theoretical perspective, we developed a categorizing framework based on the main properties pointed out in previous studies. Empirically, we conducted extensive experiments where we compared the most representative algorithm from each of the categories using a large number of real (big) data sets. The effectiveness of the candidate clustering algorithms is measured through a number of internal and external validity metrics, stability, runtime, and scalability tests. In addition, we highlighted the set of clustering algorithms that are the best performing for big data.INDEX TERMS Clustering algorithms, unsupervised learning, big data.
Although the Internet of Things (IoT) can increase efficiency and productivity through intelligent and remote management, it also increases the risk of cyber-attacks. The potential threats to IoT applications and the need to reduce risk have recently become an interesting research topic. It is crucial that effective Intrusion Detection Systems (IDSs) tailored to IoT applications be developed. Such IDSs require an updated and representative IoT dataset for training and evaluation. However, there is a lack of benchmark IoT and IIoT datasets for assessing IDSs-enabled IoT systems. This paper addresses this issue and proposes a new data-driven IoT/IIoT dataset with the ground truth that incorporates a label feature indicating normal and attack classes, as well as a type feature indicating the sub-classes of attacks targeting IoT/IIoT applications for multi-classification problems. The proposed dataset, which is named TON_IoT, includes Telemetry data of IoT/IIoT services, as well as Operating Systems logs and Network traffic of IoT network, collected from a realistic representation of a medium-scale network at the Cyber Range and IoT Labs at the UNSW Canberra (Australia). This paper also describes the proposed dataset of the Telemetry data of IoT/IIoT services and their characteristics. TON_IoT has various advantages that are currently lacking in the state-of-the-art datasets: i) it has various normal and attack events for different IoT/IIoT services, and ii) it includes heterogeneous data sources. We evaluated the performance of several popular Machine Learning (ML) methods and a Deep Learning model in both binary and multi-class classification problems for intrusion detection purposes using the proposed Telemetry dataset.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.