Processing multi-way spatial joins on map-reduce

Gupta, Himanshu; Chawda, Bhupesh; Negi, Sumit; Faruquie, Tanveer A.; Subramaniam, L. Venkata; Mohania, Mukesh

doi:10.1145/2452376.2452390

Cited by 35 publications

(23 citation statements)

References 21 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Unlike the naïve approaches discussed in [12], the cascaded pairwise spatial join in MSJS is efficient mainly because the disk I/O in Spark is much smaller than that in MapReduce. The series of pairwise spatial joins in MSJS do not perform as a number of map and reduce tasks but rather as a series of transactions in Spark that are executed in memory.…”

Section: Methodsmentioning

confidence: 99%

See 1 more Smart Citation

An Effective High-Performance Multiway Spatial Join Algorithm with Spark

Zhao

et al. 2017

IJGI

View full text Add to dashboard Cite

Multiway spatial join plays an important role in GIS (Geographic Information Systems) and their applications. With the increase in spatial data volumes, the performance of multiway spatial join has encountered a computation bottleneck in the context of big data. Parallel or distributed computing platforms, such as MapReduce and Spark, are promising for resolving the intensive computing issue. Previous approaches have focused on developing single-threaded join algorithms as an optimizing and partition strategy for parallel computing. In this paper, we present an effective high-performance multiway spatial join algorithm with Spark (MSJS) to overcome the multiway spatial join bottleneck. MSJS handles the problem through cascaded pairwise join. Using the power of Spark, the formerly inefficient cascaded pairwise spatial join is transformed into a high-performance approach. Experiments using massive real-world data sets prove that MSJS outperforms existing parallel approaches of multiway spatial join that have been described in the literature.

show abstract

Section: Methodsmentioning

confidence: 99%

“…Gupta et al developed a Controlled-Replicate framework coupled with the project-split-replicate notation to handle multiway spatial join queries [12]. Controlled-Replicate runs as a cycle of two MapReduce jobs.…”

Section: Background and Related Workmentioning

confidence: 99%

An Effective High-Performance Multiway Spatial Join Algorithm with Spark

Zhao

et al. 2017

IJGI

View full text Add to dashboard Cite

show abstract

“…Since the practical method to efficiently query against big spatial data is to employ the divide and conquer strategy [9,26], most MapReduce-based PSQPAs use certain types of space filling curves, such as Hilbert space-filling curve, to map MBRs to grids based on the spatial correlation for optimizing efficiency [27,28]. We simply treat the number of grids p as one of the internal parameters of Spark-based PSQPAs.…”

Section: Identifying Factors Impacting the Efficiency Of Spark-based mentioning

confidence: 99%

“…The basic units composing this complexity are multiway joins. Although joins are unavoidable and time-consuming, the projection operation mapping spatial correlated datasets into the same grids is commonly used in the filter and refinement stages [28]. From the viewpoint of Spark CM, the number of grids determines the number of tasks that should be executed, which can directly impact the efficiency of the PSQPAs.…”

Section: Identifying Factors Impacting the Efficiency Of Spark-based mentioning

confidence: 99%

Elastic Spatial Query Processing in OpenStack Cloud Computing Environment for Time-Constraint Data Analysis

Huang

Zhang

et al. 2017

IJGI

View full text Add to dashboard Cite

Abstract:Geospatial big data analysis (GBDA) is extremely significant for time-constraint applications such as disaster response. However, the time-constraint analysis is not yet a trivial task in the cloud computing environment. Spatial query processing (SQP) is typical computation-intensive and indispensable for GBDA, and the spatial range query, join query, and the nearest neighbor query algorithms are not scalable without using MapReduce-liked frameworks. Parallel SQP algorithms (PSQPAs) are trapped in screw-processing, which is a known issue in Geoscience. To satisfy time-constrained GBDA, we propose an elastic SQP approach in this paper. First, Spark is used to implement PSQPAs. Second, Kubernetes-managed Core Operation System (CoreOS) clusters provide self-healing Docker containers for running Spark clusters in the cloud. Spark-based PSQPAs are submitted to Docker containers, where Spark master instances reside. Finally, the horizontal pod auto-scaler (HPA) would scale-out and scale-in Docker containers for supporting on-demand computing resources. Combined with an auto-scaling group of virtual instances, HPA helps to find each of the five nearest neighbors for 46,139,532 query objects from 834,158 spatial data objects in less than 300 s. The experiments conducted on an OpenStack cloud demonstrate that auto-scaling containers can satisfy time-constraint GBDA in clouds.

show abstract

“…The loose Octree [30] allows for a degree of imprecision so that objects can be assigned to lower levels when they intersect only slightly with a cell. The idea of using grids to parallelize the join has also been optimized for GPUs [33] as well as on a larger scale on the MapReduce framework [12,23].THERMAL-JOIN as presented here is single threaded but can be parallelized like the aforementioned approaches.…”

Section: Iterative Static Spatial Joinmentioning

confidence: 99%

Thermal-Join

Tauheed¹,

Heinis

Ailamaki

2015

Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data

View full text Add to dashboard Cite

Simulations have become ubiquitous in many domains of science. Today scientists study natural phenomena by first building massive three-dimensional spatial models and then by simulating the models at discrete intervals of time to mimic the behavior of natural phenomena. One frequently occurring challenge during simulations is the repeated computation of spatial self-joins of the model at each simulation time step. The join is performed to access a group of neighboring spatial objects (groups of particles, molecules or cosmological objects) so that scientists can calculate the cumulative effect (like gravitational force) on an object.Computing a self-join even in memory, soon becomes a performance bottleneck in simulation applications. The problem becomes even worse as scientists continue to improve the precision of simulations by increasing the number as well as the size (3D extent) of the objects. This leads to an exponential increase in join selectivity that challenges the performance and scalability of state-of-the-art approaches.We propose THERMAL-JOIN, a novel spatial self-join algorithm for dynamic memory-resident workloads. The algorithm groups objects in spatial proximity together into hot spots. Hot spots minimize the cost of computing join as objects assigned to a hot spot are guaranteed to overlap with each other. Using a nested spatial grid, THERMAL-JOIN partitions and indexes the dataset to locate hot spots. With experiments we show that our approach provides a speedup between 8 to 12× compared to the state of the art and also scales as scientists improve the precision of their simulations.

show abstract

Processing multi-way spatial joins on map-reduce

Cited by 35 publications

References 21 publications

An Effective High-Performance Multiway Spatial Join Algorithm with Spark

An Effective High-Performance Multiway Spatial Join Algorithm with Spark

Elastic Spatial Query Processing in OpenStack Cloud Computing Environment for Time-Constraint Data Analysis

Thermal-Join

Contact Info

Product

Resources

About