Distributed stream clustering using micro-clusters on Apache Storm

Karunaratne, Pasan; Karunasekera, Shanika; Harwood, Aaron

doi:10.1016/j.jpdc.2016.06.004

Cited by 28 publications

(12 citation statements)

References 26 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In an IoT environment, data gathering and real-time data analysis are two prime concerns because of several data outsourcing (sensor) devices, which send small data (e.g., GPS coordinates) vs large data (e.g., surveillance videos) possibly at a very high speed. However, the current stream processing systems are not able to handle such a high-velocity data [158] and require explicit ingestion corresponding to an underlying system [159]. Hence, the existing systems in a geo-distributed IoT system cannot support multiple platforms and underlying databases.…”

Section: Concluding Remarks and Open Issuesmentioning

confidence: 99%

A Survey on Geographically Distributed Big-Data Processing Using MapReduce

Dolev

Florissi

Gudes

et al. 2019

IEEE Trans. Big Data

View full text Add to dashboard Cite

Abstract-Hadoop and Spark are widely used distributed processing frameworks for large-scale data processing in an efficient and fault-tolerant manner on private or public clouds. These big-data processing systems are extensively used by many industries, e.g., Google, Facebook, and Amazon, for solving a large class of problems, e.g., search, clustering, log analysis, different types of join operations, matrix multiplication, pattern matching, and social network analysis. However, all these popular systems have a major drawback in terms of locally distributed computations, which prevent them in implementing geographically distributed data processing. The increasing amount of geographically distributed massive data is pushing industries and academia to rethink the current big-data processing systems. The novel frameworks, which will be beyond state-of-the-art architectures and technologies involved in the current system, are expected to process geographically distributed data at their locations without moving entire raw datasets to a single location. In this paper, we investigate and discuss challenges and requirements in designing geographically distributed data processing frameworks and protocols. We classify and study batch processing (MapReduce-based systems), stream processing (Spark-based systems), and SQL-style processing geo-distributed frameworks, models, and algorithms with their overhead issues.

show abstract

Section: Concluding Remarks and Open Issuesmentioning

confidence: 99%

A Survey on Geographically Distributed Big-Data Processing Using MapReduce

Dolev

Florissi

Gudes

et al. 2019

IEEE Trans. Big Data

View full text Add to dashboard Cite

show abstract

“…If scaling is linear, a smart city could start from a three-node cluster and scale when needed to thousands of nodes and get a proportional processing boost. To choose from the plethora of solutions which are potentially useful in a smart city environment and propose the architecture, we used the datasets described in Section 4 and the criteria described in Section 5 to evaluate:Two bulk data loading solutions: Apache Sqoop [50] vs. Oracle Loader for Hadoop [51];Two streaming solutions: Spark Streaming [52] vs. Apache Storm [53];Two NoSQL databases relevant for a smart city architecture: HBase [54] vs. Cassandra [55];Two NoSQL databases using two SQL query engines: Apache Phoenix [56] vs. Presto [57];Three Hive [58] execution engines: MapReduce vs. Tez vs. Spark [59].…”

Section: System Architecture and Componentsmentioning

confidence: 99%

“…When real-time processing with latencies in milliseconds is required, Apache Storm [53] or Spark Streaming can be used. These can be useful for processing data coming from sensors, and integrate well with a distributed message system such as Apache Kafka, that can work with hundreds of megabytes per second, from multiple clients.…”

Section: System Architecture and Componentsmentioning

confidence: 99%

Hadoop Oriented Smart Cities Architecture

Dıaconıța

Bologa

2018

Sensors

View full text Add to dashboard Cite

A smart city implies a consistent use of technology for the benefit of the community. As the city develops over time, components and subsystems such as smart grids, smart water management, smart traffic and transportation systems, smart waste management systems, smart security systems, or e-governance are added. These components ingest and generate a multitude of structured, semi-structured or unstructured data that may be processed using a variety of algorithms in batches, micro batches or in real-time. The ICT architecture must be able to handle the increased storage and processing needs. When vertical scaling is no longer a viable solution, Hadoop can offer efficient linear horizontal scaling, solving storage, processing, and data analyses problems in many ways. This enables architects and developers to choose a stack according to their needs and skill-levels. In this paper, we propose a Hadoop-based architectural stack that can provide the ICT backbone for efficiently managing a smart city. On the one hand, Hadoop, together with Spark and the plethora of NoSQL databases and accompanying Apache projects, is a mature ecosystem. This is one of the reasons why it is an attractive option for a Smart City architecture. On the other hand, it is also very dynamic; things can change very quickly, and many new frameworks, products and options continue to emerge as others decline. To construct an optimized, modern architecture, we discuss and compare various products and engines based on a process that takes into consideration how the products perform and scale, as well as the reusability of the code, innovations, features, and support and interest in online communities.

show abstract

“…As this type of data is so big and various, it needs to be processed extremely fast and efficiently to allow final users to take advantage of it in real time. This leads us to the conclusion that traditional data processing methods which are applied to structured data will not fit unstructured spatial big data . In the subsequent section, a solution, also known as big data architecture, that was previously developed by Amini et al (2017) based on Apache Kafka for handling spatial big data in real time is presented.…”

Section: Spatial Big Data As the Future Of Road Transportmentioning

confidence: 99%

Proposal of big data route selection methods for autonomous vehicles

Reddig

Dikunow

Krzykowska

2018

Internet Technology Letters

View full text Add to dashboard Cite

The automotive industry is developing rapidly, mainly in the direction of automation. Automation means constructing self‐driving vehicles that will not need any human assistance. This would not only rise the comfort of travel but most of all it would allow to minimize congestions and reduce car accidents. In this paper, we propose a solution for self‐driving vehicles' development based on big data Kafka and Spark architectures that would enable to gather large quantities of diversified data in real‐time. This, in turn would allow the future cars to make fast, autonomous decisions concerning optimal route selection.

show abstract

Distributed stream clustering using micro-clusters on Apache Storm

Cited by 28 publications

References 26 publications

A Survey on Geographically Distributed Big-Data Processing Using MapReduce

A Survey on Geographically Distributed Big-Data Processing Using MapReduce

Hadoop Oriented Smart Cities Architecture

Proposal of big data route selection methods for autonomous vehicles

Contact Info

Product

Resources

About