AutoTM: Automatic Tensor Movement in Heterogeneous Memory Systems using Integer Linear Programming

Hildebrand, Mark; Khan, Jadoon; Trika, Sanjeev; Lowe-Power, Jason; Akella, Venkatesh

doi:10.1145/3373376.3378465

Cited by 50 publications

(22 citation statements)

References 32 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…A straightforward method to reduce the DRAM footprint is to asynchronously flush cached regions into NVM and reclaim them in advance. However, prior work [20] has reported negative results on this method. Although asynchronous flushing can reduce DRAM consumption, it introduces NVM write operations, which may lead to less NVM bandwidth and worse GC performance.…”

Section: Asynchronous Region Flushingmentioning

confidence: 88%

“…When the number of write operations increases, the overall bandwidth will be strongly affected and decline. This problem is possibly caused by NVM's asymmetric bandwidth: its peak read bandwidth is much larger than the peak write bandwidth [20,24]. Other NVM technologies like phase-change memory (PCM) also have similar problems [34].…”

Section: Detailed Bandwidth Analysismentioning

confidence: 99%

See 1 more Smart Citation

Bridging the performance gap for copy-based garbage collectors atop non-volatile memory

Yang

Chen

et al. 2021

Proceedings of the Sixteenth European Conference on Computer Systems

View full text Add to dashboard Cite

Non-volatile memory (NVM) is expected to revolutionize the memory hierarchy with not only non-volatility but also large capacity and power efficiency. Memory-intensive applications, which are often written in managed languages like Java, would run atop NVM for better cost-efficiency. Unfortunately, such applications may suffer from performance slowdown due to the unmanaged performance gap between DRAM and NVM. This paper studies the performance of a series of Java applications atop NVM and uncovers that the copy-based garbage collection (GC), the mainstream GC algorithm, is an NVM-unfriendly component in JVM. GC becomes a severe performance bottleneck especially when memory resource is scarce. To this end, this paper analyzes the memory behavior of copy-based GC and uncovers that its inappropriate usage on NVM bandwidth is the main reason for its performance slowdown. This paper thus proposes two NVM-aware optimizations: write cache and header map, to effectively manage the limited NVM bandwidth. It further improves the GC performance with hardware instructions like non-temporal memory accesses and prefetching. We have implemented the optimizations on two mainstream copy-based garbage collectors in OpenJDK. Evaluation with various memory-intensive applications shows that our optimizations can improve the GC time, application execution time, application tail latency by up to 2.69×, 11.0%, and 5.09×, respectively. CCS Concepts: • Software and its engineering → Runtime environments; • Hardware → Memory and dense storage.

show abstract

Section: Asynchronous Region Flushingmentioning

confidence: 88%

Section: Detailed Bandwidth Analysismentioning

confidence: 99%

Bridging the performance gap for copy-based garbage collectors atop non-volatile memory

Yang

Chen

et al. 2021

Proceedings of the Sixteenth European Conference on Computer Systems

View full text Add to dashboard Cite

show abstract

“…The input sizes, ℎ ℎ , ℎ and ℎ ℎ are obtained by offline analysis. For many DNN training workloads, once their hyperparameters (e.g., batch size) are determined, the input sizes for each operation can be known before the training takes place [24,32,33,54].…”

Section: Offline Fpga Kernel Optimizationmentioning

confidence: 99%

“…We target common DNN models whose dataflow graphs do not exhibit data-dependent control, and each training step goes through exactly the same graph, which implies the input data of operations can be known before training. Such DNN models are very common and have been the targets of recent works [24,32,33,54,73].…”

Section: Introductionmentioning

confidence: 99%

Enabling energy-efficient DNN training on hybrid GPU-FPGA accelerators

Liu

Xie

et al. 2021

Proceedings of the ACM International Conference on Supercomputing

View full text Add to dashboard Cite

DNN training consumes orders of magnitude more energy than inference and requires innovative use of accelerators to improve energy-efficiency. However, despite having complementary features, GPUs and FPGAs have been mostly used independently for the entire training process, thus neglecting the opportunity in assigning individual but distinct operations to the most suitable hardware. In this paper, we take the initiative to explore new opportunities and viable solutions in enabling energy-efficient DNN training on hybrid accelerators. To overcome fundamental challenges including avoiding training throughput loss, enabling fast design space exploration, and efficient scheduling, we propose a comprehensive framework, Hype-training, that utilizes a combination of offline characterization, performance modeling, and online scheduling of individual operations. Experimental tests using NVIDIA V100 GPUs and Intel Stratix 10 FPGAs show that, Hype-training is able to exploit a mixture of GPUs and FPGAs at a fine granularity to achieve significant energy reduction, by 44.3% on average and up to 59.7%, without any loss in training throughput. Hype-training can also enforce power caps more effectively than state-of-the-art power management mechanisms on GPUs.

show abstract

“…As a result, PM and DRAM form a heterogeneous memory (HM) system. How to place and migrate data between PM and DRAM to enjoy the speed of DRAM and capacity of PM remains active research [7,11,22,26,39,40].…”

Section: Introductionmentioning

confidence: 99%

Optimizing large-scale plasma simulations on persistent memory-based heterogeneous memory with effective data placement across memory hierarchy

Ren

Luo

et al. 2021

Proceedings of the ACM International Conference on Supercomputing

View full text Add to dashboard Cite

Particle simulations of plasma are important for understanding plasma dynamics in space weather and fusion devices. However, production simulations that use billions and even trillions of computational particles require high memory capacity. In this work, we explore the latest persistent memory (PM) hardware to enable large-scale plasma simulations at unprecedented scales on a single machine. We use WarpX, an advanced plasma simulation code which is mission-critical and targets future exascale systems. We analyze the performance of WarpX on PM-based heterogeneous memory systems and propose to make the best use of memory hierarchy to avoid the impact of inferior performance of PM. We introduce a combination of static and dynamic data placement, and processor-cache prefetch mechanism for performance optimization. We develop a performance model to enable efficient data migration between PM and DRAM in the background, without reducing available bandwidth and parallelism to the application threads. We also build an analytical model to decide when to prefetch for the best use of caches. Our design achieves 66.4% performance improvement over the PM-only baseline and outperforms DRAM-cached, NUMA first-touch, and a state-of-the-art software solution by 38.8%, 45.1% and 83.3%, respectively. CCS CONCEPTS• Software and its engineering → Software performance.

show abstract

AutoTM: Automatic Tensor Movement in Heterogeneous Memory Systems using Integer Linear Programming

Cited by 50 publications

References 32 publications

Bridging the performance gap for copy-based garbage collectors atop non-volatile memory

Bridging the performance gap for copy-based garbage collectors atop non-volatile memory

Enabling energy-efficient DNN training on hybrid GPU-FPGA accelerators

Optimizing large-scale plasma simulations on persistent memory-based heterogeneous memory with effective data placement across memory hierarchy

Contact Info

Product

Resources

About