Understanding and optimizing the performance of distributed machine learning applications on apache spark

Dünner, Celestine; Parnell, Thomas; Atasu, Kubilay; Sifalakis, Manolis; Pozidis, Haralampos

doi:10.1109/bigdata.2017.8257942

Cited by 8 publications

(5 citation statements)

References 4 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Several research efforts for distributed learning of neural networks have been conducted based on Apache Spark. Dunner et al [ 33 ] proposed practical techniques to achieve the best performance in Apache Spark, targeting any distributed algorithms and infrastructures. Zhao et al [ 34 ] proposed a scalable stochastic optimization method on Apache Spark that achieves both computation and communication efficiency.…”

Section: Related Workmentioning

confidence: 99%

Effective Model Update for Adaptive Classification of Text Streams in a Distributed Learning Environment

Kim

Lim

Lee

et al. 2022

Sensors

View full text Add to dashboard Cite

In this study, we propose dynamic model update methods for the adaptive classification model of text streams in a distributed learning environment. In particular, we present two model update strategies: (1) the entire model update and (2) the partial model update. The former aims to maximize the model accuracy by periodically rebuilding the model based on the accumulated datasets including recent datasets. Its learning time incrementally increases as the datasets increase, but we alleviate the learning overhead by the distributed learning of the model. The latter fine-tunes the model only with a limited number of recent datasets, noting that the data streams are dependent on a recent event. Therefore, it accelerates the learning speed while maintaining a certain level of accuracy. To verify the proposed update strategies, we extensively apply them to not only fully trainable language models based on CNN, RNN, and Bi-LSTM, but also a pre-trained embedding model based on BERT. Through extensive experiments using two real tweet streaming datasets, we show that the entire model update improves the classification accuracy of the pre-trained offline model; the partial model update also improves it, which shows comparable accuracy with the entire model update, while significantly increasing the learning speed. We also validate the scalability of the proposed distributed learning architecture by showing that the model learning and inference time decrease as the number of worker nodes increases.

show abstract

Section: Related Workmentioning

confidence: 99%

Effective Model Update for Adaptive Classification of Text Streams in a Distributed Learning Environment

Kim

Lim

Lee

et al. 2022

Sensors

View full text Add to dashboard Cite

show abstract

“…On the contrary, frameworks like MPI (Message Passing Interface) would likely need more complex hand-code and fine tuning (e.g. manually setting chunking size, worker task and their synchronization), and require programming skills that are often well beyond competences of most computational scientists and researchers Dunner et al (2017).…”

Section: Multiple Node Experimentationmentioning

confidence: 99%

Development details and computational benchmarking of DEPAM

Duc,

Cazau

2019

Preprint

View full text Add to dashboard Cite

In the big data era of observational oceanography, passive acoustics datasets are becoming too high volume to be processed on local computers due to their processor and memory limitations. As a result there is a current need for our community to turn to cloud-based distributed computing. We present a scalable computing system for FFT (Fast Fourier Transform)-based features (e.g., Power Spectral Density) based on the Apache distributed frameworks Hadoop and Spark. These features are at the core of many different types of acoustic analysis where the need of processing data at scale with speed is evident, e.g. serving as longterm averaged learning representations of soundscapes to identify periods of acoustic interest. In addition to provide a complete description of our system implementation, we also performed a computational benchmark comparing our system to three other Scala-only, Matlab and Python based systems in standalone executions, and evaluated its scalability using the speed up metric. Our current results are very promising in terms of computational performance, as we show that our proposed Hadoop/Spark system performs reasonably well on a single node setup comparatively to state-of-the-art processing tools used by the PAM community, and that it could also fully leverage more intensive cluster resources with a almost-linear scalability behaviour above a certain dataset volume.

show abstract

“…Apache Spark introduces a number of additional layers into the software stack and thus a number of associated overheads. For this reason, we typically observe that the performance of the Spark-based deployments of Snap ML are slower than those using MPI [6].…”

Section: Software Architecturementioning

confidence: 99%

“…It provides distributed training of a variety of machine learning models and provides easy-to-use APIs in Java, Scala and Python. It does not natively support GPU acceleration, and while it can leverage underlying native libraries such as BLAS, it tends to exhibit slower performance relative to the same distributed algorithms implemented natively in C++ using high-performance computing frameworks such as MPI [6].…”

Section: Introductionmentioning

confidence: 99%

Snap ML: A Hierarchical Framework for Machine Learning

Dünner¹,

Parnell²,

Sarigiannis³

et al. 2018

Preprint

Self Cite

View full text Add to dashboard Cite

We describe a new software framework for fast training of generalized linear models. The framework, named Snap Machine Learning (Snap ML), combines recent advances in machine learning systems and algorithms in a nested manner to reflect the hierarchical architecture of modern computing systems. We prove theoretically that such a hierarchical system can accelerate training in distributed environments where intra-node communication is cheaper than inter-node communication. Additionally, we provide a review of the implementation of Snap ML in terms of GPU acceleration, pipelining, communication patterns and software architecture, highlighting aspects that were critical for achieving high performance. We evaluate the performance of Snap ML in both single-node and multi-node environments, quantifying the benefit of the hierarchical scheme and the data streaming functionality, and comparing with other widely-used machine learning software frameworks. Finally, we present a logistic regression benchmark on the Criteo Terabyte Click Logs dataset and show that Snap ML achieves the same test loss an order of magnitude faster than any of the previously reported results, including those obtained using TensorFlow and scikit-learn.

show abstract

Understanding and optimizing the performance of distributed machine learning applications on apache spark

Cited by 8 publications

References 4 publications

Effective Model Update for Adaptive Classification of Text Streams in a Distributed Learning Environment

Effective Model Update for Adaptive Classification of Text Streams in a Distributed Learning Environment

Development details and computational benchmarking of DEPAM

Snap ML: A Hierarchical Framework for Machine Learning

Contact Info

Product

Resources

About