Stochastic gradient descent on GPUs

Kaleem, Rashid; Pai, Sreepathi; Pingali, Keshav

doi:10.1145/2716282.2716289

Cited by 32 publications

(15 citation statements)

References 9 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…There are several existing works that have compared GPU performance with CPUs. For example, it is reported that a GPU implementation of Stochastic Gradient Descent (SGD) performs as good as 14 cores on a 40-core CPU system [20]. For Single-Source Shortest Path (SSSP) problem, it is reported that an efficient serial implementation can outperform highly parallel GPU implementations for high-diameter or scale-free graphs [21].…”

Section: Methodsmentioning

confidence: 99%

Energy efficient architecture for graph analytics accelerators

Özdal

Yesil

Kim

et al. 2016

SIGARCH Comput. Archit. News

View full text Add to dashboard Cite

Specialized hardware accelerators can significantly improve the performance and power efficiency of compute systems. In this paper, we focus on hardware accelerators for graph analytics applications and propose a configurable architecture template that is specifically optimized for iterative vertex-centric graph applications with irregular access patterns and asymmetric convergence. The proposed architecture addresses the limitations of the existing multi-core CPU and GPU architectures for these types of applications. The SystemC-based template we provide can be customized easily for different vertex-centric applications by inserting application-level data structures and functions. After that, a cycle-accurate simulator and RTL can be generated to model the target hardware accelerators. In our experiments, we study several graph-parallel applications, and show that the hardware accelerators generated by our template can outperform a 24 core high end server CPU system by up to 3x in terms of performance. We also estimate the area requirement and power consumption of these hardware accelerators through physical-aware logic synthesis, and show up to 65x better power consumption with significantly smaller area.

show abstract

Section: Methodsmentioning

confidence: 99%

Energy efficient architecture for graph analytics accelerators

Özdal

Yesil

Kim

et al. 2016

SIGARCH Comput. Archit. News

View full text Add to dashboard Cite

show abstract

“…DSGD (Distribute SGD) partitions the ratings matrix into several blocks and updates a set of independent blocks concurrently [8]. Kaleem et al show that the parallel SGD can run efficiently on GPU, and their implementation on GPU is comparable to a 14-thread CPU implementation [51]. Jinoh et al propose MLGF-MF, which is robust to skewed matrices and runs efficiently on blockstorage devices (e.g., SSD disks) as well as shared-memory platforms.…”

Section: Related Workmentioning

confidence: 99%

clMF: A fine-grained and portable alternating least squares algorithm for parallel matrix factorization

Chen

Fang

Liu

et al. 2020

Future Generation Computer Systems

View full text Add to dashboard Cite

Alternating least squares (ALS) has been proved to be an effective solver for matrix factorization in recommender systems. To speed up factorizing performance, various parallel ALS solvers have been proposed to leverage modern multi-cores and many-cores. Existing implementations are limited in either speed or portability. In this paper, we present an efficient and portable ALS solver (clMF) for recommender systems. On one hand, we diagnose the baseline implementation and observe that it lacks of the awareness of the hierarchical thread organization on modern hardware. To achieve high performance, we apply the thread batching technique, the fine-grained tiling technique and three architecture-specific optimizations. On the other hand, we implement the ALS solver in OpenCL so that it can run on various platforms (CPUs, GPUs and MICs). Based on the architectural specifics, we select a suitable code variant for each platform to efficiently map it to the underlying hardware. The experimental results show that our implementation performs 2.8×-15.7× faster on an Intel 16-core CPU, 23.9×-87.9× faster on an NVIDIA K20C GPU and 34.6×-97.1× faster on an AMD Fury X GPU than the baseline implementation. On the K20C GPU, our implementation also outperforms cuMF over different latent features ranging from 10 to 100 with various real-world recommendation datasets.

show abstract

“…Unlike the Sigmoid and tanh functions, in which the gradient may disappear, PENLU does not exhibit this phenomenon because it does not have a right saturation property, and its derivative does not approach 0. By using the backward transfer SGD (Kaleem et al, 2015) algorithm, the parameters such as β and α are optimized so that they can be switched between the exponential unit and the rectifier unit at random, and the linear and nonlinear adjustment between them is possible. This design of PENLU is more flexible than ReLU, PReLU and ELU, and the latter can be regarded as a special case of PENLU.…”

Section: Parametric Exponential Nonlinear Unitmentioning

confidence: 99%

Identification of Navel Orange Lesions by Nonlinear Deep Learning Algorithm

Yang

Hong

2018

Eng. Agríc.

View full text Add to dashboard Cite

It is difficult for humans to recognize recessive diseases in navel oranges. Therefore, deep neural networks are applied to plant disease identification. To improve the feature extraction ability of convolutional neural networks, the Parameter Exponential Nonlinear Activation Unit (PENLU) is proposed to replace the activated function of the neural network. This function not only adds multiple parameters but also brings better generalization ability to the neural network. In addition, the proposed function parameters can be updated by the inverse Stochastic Gradient Descent (SGD) algorithm, which has unparalleled advantages over the existing activated functions. The Residual Network (ResNet), improved by PENLU, is applied to navel orange lesion recognition and achieves the most advanced accuracy compared with traditional lesion recognition methods. It is worth mentioning that the data set of navel orange leaf images proposed in this paper will provide samples for subsequent research. The code and model are available at the website https://github.com/xncaffe/caffe_penlu.

show abstract

Stochastic gradient descent on GPUs

Cited by 32 publications

References 9 publications

Energy efficient architecture for graph analytics accelerators

Energy efficient architecture for graph analytics accelerators

clMF: A fine-grained and portable alternating least squares algorithm for parallel matrix factorization

Identification of Navel Orange Lesions by Nonlinear Deep Learning Algorithm

Contact Info

Product

Resources

About