Albert Njoroge Kahira scite author profile

Albert Njoroge Kahira

5Publications

5Citation Statements Received

111Citation Statements Given

How they've been cited

How they cite others

110

Affiliations

Forschungszentrum Jülich, Barcelona Supercomputing Center, Universitat Politècnica de Catalunya

Publications

Order By: Most citations

A Study of Checkpointing in Large Scale Training of Deep Neural Networks

Rojas¹,

Kahira²,

Meneses³

et al. 2020

Preprint

View full text Add to dashboard Cite

Deep learning (DL) applications are increasingly being deployed on HPC systems, to leverage the massive parallelism and computing power of those systems for DL model training. While significant effort has been put to facilitate distributed training by DL frameworks, fault tolerance has been largely ignored. In this work, we evaluate checkpoint-restart, a common fault tolerance technique in HPC workloads. We perform experiments with three state-of-the-art DL frameworks common in HPC (Chainer, PyTorch, and TensorFlow). We evaluate the computational cost of checkpointing, file formats and file sizes, the impact of scale, and deterministic checkpointing. Our evaluation shows some critical differences in checkpoint mechanisms and exposes several bottlenecks in existing checkpointing implementations. We provide discussion points that can aid users in selecting a fault-tolerant framework to use in HPC. We also provide takeaway points that framework developers can use to facilitate better checkpointing of DL workloads in HPC.

show abstract

Training Deep Neural Networks with Low Precision Input Data: A Hurricane Prediction Case Study

Kahira

Bautista-Gomez

Badía

2018

View full text Add to dashboard Cite

Training deep neural networks requires huge amounts of data. The next generation of intelligent systems will generate and utilise massive amounts of data which will be transferred along machine learning workflows. We study the effect of reducing the precision of this data at early stages of the workflow (i.e input) on both prediction accuracy and learning behaviour of deep neural networks. We show that high precision data can be transformed to low precision before feeding it to a neural network model with insignificant depreciation in accuracy. As such, a high precision representation of input data is not entirely necessary for some applications. The findings of this study pave way for the application of deep learning in areas where acquiring high precision data is difficult due to both memory and computational power constraints. We further use a hurricane prediction case study where we predict the monthly number of hurricanes on the Atlantic Ocean using deep neural networks. We train a deep neural network model that predicts the number of hurricanes, first, by using high precision input data and then by using low precision data. This leads to only a drop in prediction accuracy of less than 2%.

show abstract

An Oracle for Guiding Large-Scale Model/Hybrid Parallel Training of Convolutional Neural Networks

Kahira

Nguyen

Bautista-Gomez

et al. 2021

View full text Add to dashboard Cite

Deep Neural Network (DNN) frameworks use distributed training to enable faster time to convergence and alleviate memory capacity limitations when training large models and/or using high dimension inputs. With the steady increase in datasets and model sizes, model/hybrid parallelism is deemed to have an important role in the future of distributed training of DNNs. We analyze the compute, communication, and memory requirements of Convolutional Neural Networks (CNNs) to understand the trade-offs between different parallelism approaches on performance and scalability. We leverage our model-driven analysis to be the basis for an oracle utility which can help in detecting the limitations and bottlenecks of different parallelism approaches at scale. We evaluate the oracle on six parallelization strategies, with four CNN models and multiple datasets (2D and 3D), on up to 1024 GPUs. The results demonstrate that the oracle has an average accuracy of about 86.74% when compared to empirical results, and as high as 97.57% for data parallelism. CCS CONCEPTS• Computing methodologies → Parallel computing methodologies; Distributed computing methodologies; Machine learning.

show abstract

Accelerating Hyperparameter Optimisation with PyCOMPSs

Kahira

Bautista-Gomez

Conejero

et al. 2019

View full text Add to dashboard Cite

Machine Learning applications now span across multiple domains due to the increase in computational power of modern systems. There has been a recent surge in Machine Learning applications in High Performance Computing (HPC) in an attempt to speed up training. However, besides training, hyperparameters optimisation(HPO) is one of the most time consuming and resource intensive parts in a Machine Learning Workflow. Numerous algorithms and tools exist to accelerate the process of finding the right parameters for a model. Most of these tools do not utilize the parallelism provided by modern systems and are serial or limited to a single node. The few ones that are offer distributed execution require a serious amount of programming effort.There is, therefore, a need for a tool/scheme that can scale and leverage HPC infrastructures such as supercomputers, with minimum programmers effort and little or no overhead in performance. We present a HPO scheme built on top of PyCOMPSs, a programming model and runtime which aims to ease the development of parallel applications for distributed infrastructures. We show that PyCOMPSs is a powerful framework that can accelerate the process of Hyperparameter Optimisation across multiple devices and computing units. We also show that PyCOMPSs provides easy programmability, seamless distribution and scalability, key features missing in existing tools. Furthermore, we perform a detailed performance analysis showing different configurations to demonstrate the effectiveness our approach. CCS CONCEPTS• Computing methodologies → Parallel computing methodologies; Machine learning.

show abstract

Early Results of Mapping Industrial Applications on Heterogeneous HPC Systems

Theodoropoulos

Pekridis

Miliadis

et al. 2023

View full text Add to dashboard Cite

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

customersupport@researchsolutions.com

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.