The demand for articial intelligence has grown signicantly over the last decade and this growth has been fueled by advances in machine learning techniques and the ability to leverage hardware acceleration. However, in order to increase the quality of predictions and render machine learning solutions feasible for more complex applications, a substantial amount of training data is required. Although small machine learning models can be trained with modest amounts of data, the input for training larger models such as neural networks grows exponentially with the number of parameters. Since the demand for processing training data has outpaced the increase in computation power of computing machinery, there is a need for distributing the machine learning workload across multiple machines, and turning the centralized into a distributed system. These distributed systems present new challenges, rst and foremost the ecient parallelization of the training process and the creation of a coherent model. This article provides an extensive overview of the current state-of-the-art in the eld by outlining the challenges and opportunities of distributed machine learning over conventional (centralized) machine learning, discussing the techniques used for distributed machine learning, and providing an overview of the systems that are available. 3:3200x over conventional CPUs for an image recognition algorithm using a pretrained multilayer perceptron (MLP).An alternative to generic GPUs for acceleration is the use of Application Specic Integrated Circuits (ASICs) which implement specialized functions through a highly optimized design. In recent times, the demand for such chips has risen signicantly [100]. When applied to e.g. Bitcoin mining, ASICs have a signicant competitive advantage over GPUs and CPUs due to their high performance and power eciency [145]. Since matrix multiplications play a prominent role in many machine learning algorithms, these workloads are highly amenable to acceleration through ASICS. Google applied this concept in their Tensor Processing Unit (TPU) [129], which, as the name suggests, is an ASIC that specializes in calculations on tensors (n-dimensional arrays), and is designed to accelerate their Tensorow [1][2] framework, a popular building block for machine learning models. The most important component of the TPU is its Matrix Multiply unit based on a systolic array. TPUs use a MIMD (Multiple Instructions, Multiple Data) [51] architecture which, unlike GPUs, allows them to execute diverging branches eciently. TPUs are attached to the server system through the PCI Express bus. This provides them with a direct connection with the CPU which allows for a high aggregated bandwidth of 63GB/s (PCI-e5x16). Multiple TPUs can be used in a data center and the individual units can collaborate in a distributed setting. The benet of the TPU over regular CPU/GPU setups is not only its increased processing power but also its power eciency, which is important in large-scale applications due to the cost of energy and the lim...
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.