In this report, I discuss the history and current state of GPU HPC systems. Although high-power GPUs have only existed a short time, they have found rapid adoption in deep learning applications. I also discuss an implementation of a commodity-hardware NVIDIA GPU HPC cluster for deep learning research and academic teaching use.
and GPU HPC HistoryHigh performance computing (HPC) is typically characterized by large amounts of memory and processing power. HPC, sometimes also called supercomputing, has been around since the 1960s with the introduction of the CDC STAR-100, and continues to push the limits of computing power and capabilities for large-scale problems [1,2]. However, use of graphics processing unit (GPU) in HPC supercomputers has only started in the mid to late 2000s [3,4]. Although graphics processing chips have been around since the 1970s, GPUs were not widely used for computations until the 2000s. During the early 2000s, GPU clusters began to appear for HPC applications. Most of these clusters were designed to run large calculations requiring vast computing power, and many clusters are still designed for that purpose [5].GPUs have been increasingly used for computations due to their commodification, following Moore's Law (demonstrated in Figure 1), and usage in specific applications like neural networks. Although server-grade GPUs can be used in clusters, commodity-grade GPUs are much more cost-effective. A similar amount of computing power with commodity hardware can be obtained for roughly a third of the cost of server-grade hardware. In 2018 NVIDIA suddenly forced businesses to replace commodity GPUs with their server-grade GPUs in what appeared to be primarily motivated by a desire to increase earnings, but may have been related to warranty issues as well [6]. However, commodity hardware still proves to be useful for GPU clusters [7,8,9], especially in academic settings where the NVIDIA EULA does not seem to apply. Several studies have examined the performance of commodity [10,11] and non-commodity [12] GPU hardware for various calculations, and generally found commodity hardware to be suitable for use in GPU clusters. Although NVIDIA's legal definitions in their EULA are intentionally vague, it seems that using commodity NVIDIA GPUs and the associated NVIDIA drivers/software is allowed for smaller academic uses such as our use-case [13].Although some guidelines exist for GPU clusters [15] and openHPC has "recipes" which are instructions for installing SLURM on a CentOS or SUSE cluster, there is no good step-by-step documentation for creating a commodity GPU cluster from scratch using Ubuntu Linux. Ubuntu is currently one of the top-most used Linux distributions for both personal and server use and has a vibrant community as well as support, making Ubuntu a good choice for use as a Linux system. One drawback of Ubuntu is it is frequently updated and may not be as stable as other Linux OS's such as