Accelerating Proximal Policy Optimization on CPU-FPGA Heterogeneous Platforms

Meng, Yuan; Kuppannagari, Sanmukh R.; Prasanna, Viktor K.

doi:10.1109/fccm48280.2020.00012

Cited by 29 publications

(16 citation statements)

References 16 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Additionally, the compute unit contains buffers for the output and updates of neural network layers. [47] higher efficiency than the TRPO architecture according to this measure. For the accelerators implementing full DRL training, the column IPS/LUT provides a point of comparison.…”

Section: ) Comparison Of Policy Gradient Implementationsmentioning

confidence: 82%

See 1 more Smart Citation

A Survey of Domain-Specific Architectures for Reinforcement Learning

Rothmann

Porrmann

2022

IEEE Access

View full text Add to dashboard Cite

Reinforcement learning algorithms have been very successful at solving sequential decisionmaking problems in many different problem domains. However, their training is often time-consuming, with training times ranging from multiple hours to weeks. The development of domain-specific architectures for reinforcement learning promises faster computation times, decreased experiment turn-around time, and improved energy efficiency. This paper presents a review of hardware architectures for the acceleration of reinforcement learning algorithms. FPGA-based implementations are the focus of this work, but GPU-based approaches are considered as well. Both tabular and deep reinforcement learning algorithms are included in this survey. The techniques employed in different implementations are highlighted and compared. Finally, possible areas for future work are suggested, based on the preceding discussion of existing architectures.

show abstract

Section: ) Comparison Of Policy Gradient Implementationsmentioning

confidence: 82%

“…Another heterogeneous architecture was implemented by Meng et al [47] for the PPO algorithm. It is composed of a host CPU doing the loss and advantage computations and an FPGA, doing the forward propagation, backward propagation, and weight update.…”

Section: ) Implementations Of Policy Gradient Algorithmsmentioning

confidence: 99%

A Survey of Domain-Specific Architectures for Reinforcement Learning

Rothmann

Porrmann

2022

IEEE Access

View full text Add to dashboard Cite

show abstract

“…Wang et al [2019] and Zhou and Prasanna [2017] have shown that some graph algorithms are similarly well-suited to these platforms. Winterstein and Constantinides [2017] have demonstrated similar results about K-means clustering applications using a different CPU/FPGA system called the Intel Cyclone V. More recently, some machine learning applications have improved their throughput when ported from a CPU/GPU implementation to a CPU/FPGA implementation [Guo et al 2019[Guo et al , 2018Meng et al 2020].…”

Section: Further Related Workmentioning

confidence: 84%

The semantics of shared memory in Intel CPU/FPGA systems

Iorga

Donaldson

Sorensen

et al. 2021

Proc. ACM Program. Lang.

View full text Add to dashboard Cite

Heterogeneous CPU/FPGA devices, in which a CPU and an FPGA can execute together while sharing memory, are becoming popular in several computing sectors. In this paper, we study the shared-memory semantics of these devices, with a view to providing a firm foundation for reasoning about the programs that run on them. Our focus is on Intel platforms that combine an Intel FPGA with a multicore Xeon CPU. We describe the weak-memory behaviours that are allowed (and observable) on these devices when CPU threads and an FPGA thread access common memory locations in a fine-grained manner through multiple channels. Some of these behaviours are familiar from well-studied CPU and GPU concurrency; others are weaker still. We encode these behaviours in two formal memory models: one operational, one axiomatic. We develop executable implementations of both models, using the CBMC bounded model-checking tool for our operational model and the Alloy modelling language for our axiomatic model. Using these, we cross-check our models against each other via a translator that converts Alloy-generated executions into queries for the CBMC model. We also validate our models against actual hardware by translating 583 Alloy-generated executions into litmus tests that we run on CPU/FPGA devices; when doing this, we avoid the prohibitive cost of synthesising a hardware design per litmus test by creating our own 'litmus-test processor' in hardware. We expect that our models will be useful for low-level programmers, compiler writers, and designers of analysis tools. Indeed, as a demonstration of the utility of our work, we use our operational model to reason about a producer/consumer buffer implemented across the CPU and the FPGA. When the buffer uses insufficient synchronisation -- a situation that our model is able to detect -- we observe that its performance improves at the cost of occasional data corruption.

show abstract

“…Recently, instead of merely performing GCN on GPU (CPU), various experimental platforms are used for accelerating training and inference of GCN. For instance, paralleled platform [56], (multi-) FPGA platform [33,[71][72], and heterogeneous platform [73,76]. On the other hand, the costs in terms of computation and storage of sampling methods are growing large as the explosion of the graph size, putting pressure on existing experimental platforms.…”

Section: Challenges and Future Directionsmentioning

confidence: 99%

Sampling Methods for Efficient Training of Graph Convolutional Networks: A Survey

Liu

Yan

Deng

et al. 2022

IEEE/CAA J. Autom. Sinica

View full text Add to dashboard Cite

Graph Convolutional Networks (GCNs) have received significant attention from various research fields due to the excellent performance in learning graph representations. Although GCN performs well compared with other methods, it still faces challenges. Training a GCN model for large-scale graphs in a conventional way requires high computation and memory costs. Therefore, motivated by an urgent need in terms of efficiency and scalability in training GCN, sampling methods are proposed and achieve a significant effect. In this paper, we categorize sampling methods based on the sampling mechanisms and provide a comprehensive survey of sampling methods for efficient training of GCN. To highlight the characteristics and differences of sampling methods, we present a detailed comparison within each category and further give an overall comparative analysis for the sampling methods in all categories. Finally, we discuss some challenges and future research directions of the sampling methods.

show abstract

Accelerating Proximal Policy Optimization on CPU-FPGA Heterogeneous Platforms

Cited by 29 publications

References 16 publications

A Survey of Domain-Specific Architectures for Reinforcement Learning

A Survey of Domain-Specific Architectures for Reinforcement Learning

The semantics of shared memory in Intel CPU/FPGA systems

Sampling Methods for Efficient Training of Graph Convolutional Networks: A Survey

Contact Info

Product

Resources

About