SciSpark: Applying in-memory distributed computing to weather event detection and tracking

Palamuttam, Rahul; Mogrovejo, Renato Marroquin; Mattmann, Chris A.; Wilson, Brian; Whitehall, Kim; Verma, Rishi; McGibbney, L. J.; Ramírez, Paul

doi:10.1109/bigdata.2015.7363983

Cited by 36 publications

(22 citation statements)

References 5 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In addition to the research highlights we presented in the previous sections, there are other research works which have been done using Apache Spark as a core engine for solving data problems in machine learning and data mining [5,36], graph processing [16], genomic analysis [60,65], time series data [71], smart grid data [73], spatial data processing [87], scientific computations of satellite data [67], large-scale biological sequence alignment [97] and data discretization [68]. There are also some recent works on using Apache Spark for deep learning [46,64].…”

Section: Related Researchmentioning

confidence: 99%

Big data analytics on Apache Spark

Salloum

Dautov

Chen

et al. 2016

Int J Data Sci Anal

327

121

View full text Add to dashboard Cite

Apache Spark has emerged as the de facto framework for big data analytics with its advanced in-memory programming model and upper-level libraries for scalable machine learning, graph analysis, streaming and structured data processing. It is a general-purpose cluster computing framework with language-integrated APIs in Scala, Java, Python and R. As a rapidly evolving open source project, with an increasing number of contributors from both academia and industry, it is difficult for researchers to comprehend the full body of development and research behind Apache Spark, especially those who are beginners in this area. In this paper, we present a technical review on big data analytics using Apache Spark. This review focuses on the key components, abstractions and features of Apache Spark. More specifically, it shows what Apache Spark has for designing and implementing big data algorithms and pipelines for machine learning, graph analysis and stream processing. In addition, we highlight some research and development directions on Apache Spark for big data analytics.

show abstract

Section: Related Researchmentioning

confidence: 99%

Big data analytics on Apache Spark

Salloum

Dautov

Chen

et al. 2016

Int J Data Sci Anal

327

121

View full text Add to dashboard Cite

show abstract

“…Persistence Ordering: e second problem deals with ensuring that the execution order of dependent HTM transactions is correctly re ected in PM following crash recovery. As an example, consider the dependent transactions A, B, C in Listings 1, 2 & 3. e HTM will serialize their execution in some order: say A, B and C. e values of the transaction variables following the execution of A are given by the vector V 1 = [w, x, y, z] = [1, 1, 0, 0]; a er the execution of B the vector becomes V 2 = [2, 1, 2, 0] and nally following C it is 1,2,3]. Under normal operation the write backs of variables to PM from di erent transactions may become arbitrarily interleaved.…”

Section: Challenges Of Persistent Htm Transactionsmentioning

confidence: 99%

Hardware transactional persistent memory

Giles

Doshi

Varman

2018

Proceedings of the International Symposium on Memory Systems

View full text Add to dashboard Cite

Emerging Persistent Memory technologies (also PM, Non-Volatile DIMMs, Storage Class Memory or SCM) hold tremendous promise for accelerating popular data-management applications like inmemory databases. However, programmers now need to deal with ensuring the atomicity of transactions on Persistent Memory resident data and maintaining consistency between the order in which processors perform stores and that in which the updated values become durable. e problem is specially challenging when high-performance isolation mechanisms like Hardware Transactional Memory (HTM) are used for concurrency control. is work shows how HTM transactions can be ordered correctly and atomically into PM by the use of a novel so ware protocol combined with a Persistent Memory Controller, without requiring changes to processor cache hardware or HTM protocols. In contrast, previous approaches require signi cant changes to existing processor microarchitectures. Our approach, evaluated using both micro-benchmarks and the STAMP suite compares well with standard (volatile) HTM transactions. It also yields signi cant gains in throughput and latency in comparison with persistent transactional locking.

show abstract

“…In parallel, current bibliography trends highlight the adoption of Spark for both key processing steps in large-scale imaging problems [10,33,11,34,35], as well as for parallelizing dedicated machine learning and optimization algorithms [36,37,38,39]. Specifically, with regard to imaging data management over Spark, SciSpark [10,33] pre-processes structured scientific data in network Common Format (netCDF) and Hierarchical Data Format (HDF). The result is a distributed com-puting array structure suitable for supporting iterative scientific algorithms for multidimensional data, with applications on Earth Observation and climate data for weather event detection.…”

Section: The Positioning Of Apache Spark In the Distributed Learning mentioning

confidence: 99%

A distributed learning architecture for big imaging problems in astrophysics

Panousopoulou

Farrens

Mastorakis

et al. 2017

2017 25th European Signal Processing Conference (EUSIPCO)

View full text Add to dashboard Cite

Abstract-Future challenges in Big Imaging problems will require that traditional, "black-box" machine learning methods, be revisited from the perspective of ongoing efforts in distributed computing. This paper proposes a distributed architecture for astrophysical imagery, which exploits the Apache Spark framework for the efficient parallelization of the learning problem at hand. The use case is related to the challenging problem of deconvolving a space variant point spread function from noisy galaxy images. We conduct benchmark studies considering relevant datasets and analyze the efficacy of the herein developed parallelization approaches. The experimental results report 58% improvement in time response terms against the conventional computing solutions, while useful insights into the computational trade-offs and the limitations of Spark are extracted.

show abstract

SciSpark: Applying in-memory distributed computing to weather event detection and tracking

Cited by 36 publications

References 5 publications

Big data analytics on Apache Spark

Big data analytics on Apache Spark

Hardware transactional persistent memory

A distributed learning architecture for big imaging problems in astrophysics

Contact Info

Product

Resources

About