dislib: Large Scale High Performance Machine Learning in Python

Cid-Fuentes, J. Álvarez; Solà, S.; Alvarez, Pol; Castro-Ginard, A.; Badía, Rosa M.

doi:10.1109/escience.2019.00018

Cited by 19 publications

(14 citation statements)

References 24 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Extensions of task-based programming to distributed programming, such as PyCOMPSs [23], [24], Dask [25], Ray [26], Parsl [27], and Pygion [28] are gaining popularity for scientific data analysis for the mix of performance and simplicity they offer. They provide a Python interface and often the transparent parallelization of some classical APIs (or part of them) like Numpy or Pandas.…”

Section: Related Workmentioning

confidence: 99%

DEISA: Dask-Enabled In Situ Analytics

Gueroudji

Raffin

2021

2021 IEEE 28th International Conference on High Performance Computing, Data, and Analytics (HiPC)

View full text Add to dashboard Cite

A widening performance gap is separating CPU performance and IO bandwidth on large scale systems. In some fields such as weather forecast and nuclear fusion, numerical models generate such amounts of data that classical post hoc processing is not feasible anymore due to the limits in both storage capacity and IO performance. In situ approaches are attractive to bypass disk accesses in these cases and fully leverage the HPC platform. They are however often complex to set up and can require to re-develop parallel versions of the analysis from scratch.In this paper we propose a hybrid model that is well suited for in situ workflows that combine regular simulations and irregular analytics. Our model couples the bulk synchronous parallel paradigm for simulation with a distributed task-based one for analysis. This reduces complexity and leverages the best of each of these two powerful paradigms. We validate the model with a prototype, called DEISA, that supports coupling MPI parallel codes with analyses written using Dask. This implementation requires minimal modifications of both the simulation and analysis codes compared to their post hoc counterpart. It give access to an already existing rich ecosystem to be used in situ such as the parallel versions of Numpy, Pandas and scikit-learn.Experiments in configurations up to 1024 cores show that DEISA can improve the simulation wallclock time (excluding analysis) by a factor up to 3 and the total experiment (including analysis) hour.core cost by a factor of up to 5 compared to parallel post hoc with plain Dask while requiring the modification of only two lines of python code, three of YAML, and none at all in a C simulation code already instrumented with PDI Data Interface.

show abstract

Section: Related Workmentioning

confidence: 99%

DEISA: Dask-Enabled In Situ Analytics

Gueroudji

Raffin

2021

2021 IEEE 28th International Conference on High Performance Computing, Data, and Analytics (HiPC)

View full text Add to dashboard Cite

show abstract

“…Dislib [13] is a distributed machine learning library built on top of Py-COMPSs programming model. In essence, dislib is a collection of PyCOMPSs applications exposed through two main APIs: an estimator-based interface and a data handling interface.…”

Section: Dislibmentioning

confidence: 99%

“…However, as scientific data sets grow in size, it appears a need for distributed machine learning libraries that can run in traditional computational science platforms like HPC clusters. Towards this, some machine learning libraries, like MLlib [11], Dask-ML [12], dislib [13], and TensorFlow [14] have addressed scikit-learn's limitations by being able to run in multiple computers. Among these libraries, dislib is one of the better suited for HPC clusters, as it provides better performance and scalability than other similar libraries when processing large data sets in these environments [13].…”

Section: Introductionmentioning

confidence: 99%

“…Towards this, some machine learning libraries, like MLlib [11], Dask-ML [12], dislib [13], and TensorFlow [14] have addressed scikit-learn's limitations by being able to run in multiple computers. Among these libraries, dislib is one of the better suited for HPC clusters, as it provides better performance and scalability than other similar libraries when processing large data sets in these environments [13]. Dislib is built on top of PyCOMPSs programming model [15,16], and exposes two main APIs: a data handling interface to manage large scale data as if it was stored locally, and an estimator-based interface that provides the various machine learning models included in the library in an easy-to-use manner.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

ds-array: A Distributed Data Structure for Large Scale Machine Learning

Cid-Fuentes,

Álvarez,

Solà

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

Machine learning has proved to be a useful tool for extracting knowledge from scientific data in numerous research fields, including astrophysics, genomics, and molecular dynamics. Often, data sets from these research areas need to be processed in distributed platforms due to their magnitude. This can be done using one of the various distributed machine learning libraries available. One of these libraries is dislib, a distributed machine learning library for Python especially designed to process large scale data sets on HPC clusters, which makes dislib an ideal candidate for analyzing scientific data. However, dislib's main distributed data structure, called Dataset, has some limitations, including poor performance in certain operations and low flexibility and usability. In this paper, we propose a novel distributed data structure for dislib, called ds-array, that addresses dislib's main limitations in data management. Ds-arrays simplify distributed data management in dislib by exposing a NumPy-like API, provide more flexibility, and reduce the computational complexity of some operations. This results in performance improvements of up to two orders of magnitude over Datasets, while also greatly improving scalability and usability.

show abstract

“…Another area of application of the new features presented in this paper has been the dislib library 4 [4], a distributed computing machine learning library parallelized with PyCOMPSs. Some machine learning algorithms are iterative, where convergence is checked at every iteration step to decide whether the next iteration is necessary.…”

Section: Machine Learning Algorithmsmentioning

confidence: 99%

Managing Failures in Task-Based Parallel Workflows in Distributed Computing Environments

Ejarque

Trapero-Bertran

Cid-Fuentes

et al. 2020

Lecture Notes in Computer Science

Self Cite

View full text Add to dashboard Cite

Current scientific workflows are large and complex. They normally perform thousands of simulations whose results combined with searching and data analytics algorithms, in order to infer new knowledge, generate a very large amount of data. To this end, workflows comprise many tasks and some of them may fail. Most of the work done about failure management in workflow managers and runtimes focuses on recovering from failures caused by resources (retrying or resubmitting the failed computation in other resources, etc.) However, some of these failures can be caused by the application itself (corrupted data, algorithms which are not converging for certain conditions, etc.), and these fault tolerance mechanisms are not sufficient to perform a successful workflow execution. In these cases, developers have to add some code in their applications to prevent and manage the possible failures. In this paper, we propose a simple interface and a set of transparent runtime mechanisms to simplify how scientists deal with application-based failures in taskbased parallel workflows. We have validated our proposal with use-cases from e-science and machine learning to show the benefits of the proposed interface and mechanisms in terms of programming productivity and performance.

show abstract

dislib: Large Scale High Performance Machine Learning in Python

Cited by 19 publications

References 24 publications

DEISA: Dask-Enabled In Situ Analytics

DEISA: Dask-Enabled In Situ Analytics

ds-array: A Distributed Data Structure for Large Scale Machine Learning

Managing Failures in Task-Based Parallel Workflows in Distributed Computing Environments

Contact Info

Product

Resources

About