Characterization and Optimization Methodology Applied to Stencil Computations

Andreolli, C.; Thierry, Philippe; Borges, L.; Skinner, Gregg; Yount, Charles

doi:10.1016/b978-0-12-802118-7.00023-6

Cited by 21 publications

(34 citation statements)

References 2 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…asynchronous one-sided WS with distributed decen-work-commu-commu-PGAS MPI global load RTM memory tralized stealing nication nication RMA information Barros et al [24] Andreolli et al [26] Andreolli et al [27] Sena et al [28] x Hofmeyr et al [29] x Tchiboukdjian et al [30] x Imam and Sarkar [31] x x x Khaitan et al [32] x Tesser et al [33] x Tesser et al [34] x Tesser et al [35] x Padoin et al [36] x Padoin et al [37] x Sharma and Kanungo [38] x x x Zheng et al [39] x x x Martinez et al [40] x x x x Khaitan and Mccalley [41] x x x x Mor and Maillard [42] x x x x Li et al [43] x x x x x x Kumar et al [44] x x x x x x Dinan et al [21] x x x x x x Vishnu and Agarwal [49] x…”

Section: Discussionmentioning

confidence: 99%

Distributed-Memory Load Balancing With Cyclic Token-Based Work-Stealing Applied to Reverse Time Migration

et al. 2019

View full text Add to dashboard Cite

Reverse time migration (RTM) is a prominent technique in seismic imaging. Its resulting subsurface images are used in the industry to investigate with higher confidence the existence and the conditions of oil and gas reservoirs. Because of its high computational cost, RTM must make use of parallel computers. Balancing the workload distribution of an RTM is a growing challenge in distributed computing systems. The competition for shared resources and the differently-sized tasks of the RTM are some of the possible sources of load imbalance. Although many load balancing techniques exist, scaling up for large problems and large systems remains a challenge because synchronization overhead also scales. This paper proposes a cyclic token-based work-stealing (CTWS) algorithm for distributed memory systems applied to RTM. The novel cyclic token approach reduces the number of failed steals, avoids communication overhead, and simplifies the victim selection and the termination strategy. The proposed method is implemented as a C library using the one-sided communication feature of the message passing interface (MPI) standard. Results obtained by applying the proposed technique to balance the workload of a 3D RTM system present a factor of 14.1 % speedup and reductions of the load imbalance of 78.4 % when compared to the conventional static distribution. 1The migration of seismic data is the process that attempts to build an image of the Earth's interior from recorded field data. Migration places these data into their actual geological position in the subsurface using numerical approximations of either wave-theoretical or ray-theoretical approaches to simulate the propagation of seismic waves [1].The wave-theoretical approach to the propagation of seismic waves employs the finite difference method (FDM) [2,3] to numerically solve the equation describing the movement of the waves [1,4]. This approach is prevalent among the geophysical community, due to its capacity of dealing with substantial velocity variations in complex geology (e.g., pre-salt).Reverse time migration (RTM) [5-9] implements this approach. It is one of the most known FDM-based migration methods. RTM is computationally intensive in terms of data storage and handling, and its use of high-complexity algorithms. Therefore, exploiting parallelism is mandatory for RTM implementations in 3D Earth models (3D RTM) [10].Parallel architectures can be classified as shared memory, when there is a single memory address space available to all processing units (e.g., nodes or cores), or distributed memory otherwise [11]. Many scientific and industrial computational resources are distributed memory systems composed of multiprocessor nodes with shared memory systems. A hybrid parallel application works at these two levels of parallelism. It can distribute the total workload among the nodes of a distributed memory system. Each node, then, distributes its subset of the workload among the processing units of its shared memory system. Parallel machines can also be described as het...

show abstract

Section: Discussionmentioning

confidence: 99%

Distributed-Memory Load Balancing With Cyclic Token-Based Work-Stealing Applied to Reverse Time Migration

et al. 2019

View full text Add to dashboard Cite

show abstract

“…It is important to note here that there is an instruction execution overhead that the above calculations did not take into account and therefore these theoretical peak numbers are not achievable ( 80% is achievable in practice [25]). For this reason, two benchmark algorithms, STREAM TRIAD for memory bandwidth [26,27] and LINPACK for floating point performance 185 [28], are often used to measure the practical limits of a particular hardware platform.…”

Section: Establishing the Rooflinementioning

confidence: 98%

“…Performance models such as the roofline model by [1] help establish statis- 25 tics for best case performance -to evaluate the performance of a code in terms of hardware utilization (e.g. percentage of peak floating point performance) instead of a relative speed-up.…”

Section: Introductionmentioning

confidence: 99%

“…Using the derived formula for the algorithmic operational intensity in terms of stencil size, we can now analyze the optimal performance for each equation with respect to a specific computer 270 architecture. We are using the theoretical and measured hardware limitations reported by Andreolli et al [25] to demonstrate how the main algorithmic limitation shifts from being bandwidth-bound at low k to compute-bound at high k on a dual-socket Intel Xeon in Fig. 4 -6 and an Intel Xeon Phi in Fig.…”

mentioning

confidence: 99%

See 1 more Smart Citation

Performance prediction of finite-difference solvers for different computer architectures

Louboutin

Lange

Herrmann

et al. 2017

Computers & Geosciences

View full text Add to dashboard Cite

The life-cycle of a partial differential equation (PDE) solver is often characterized by three development phases: the development of a stable numerical discretization; development of a correct (verified) implementation; and the optimization of the implementation for different computer architectures. Often it is only after significant time and effort has been invested that the performance bottlenecks of a PDE solver are fully understood, and the precise details varies between different computer architectures. One way to mitigate this issue is to establish a reliable performance model that allows a numerical analyst to make reliable predictions of how well a numerical method would perform on a given computer architecture, before embarking upon potentially long and expensive implementation and optimization phases. The availability of a reliable performance model also saves developer effort as it both informs the developer on what kind of optimisations are beneficial, and when the maximum expected performance has been reached and optimisation work should stop. We show how discretization of a wave-equation can be theoretically studied to understand the performance limitations of the method on modern computer architectures. We focus on the roofline model, now broadly used in the high-performance computing community, which considers the achievable performance in terms of the peak memory bandwidth and peak floating point performance of a computer with respect to algorithmic choices. A first principles analysis of operational intensity for key time-stepping finite-difference algorithms is presented. With this information available at the time of algorithm design, the expected performance on target computer systems can be used as a driver for algorithm design

show abstract

“…In [6], Andreolli et al focused on acoustic wave propagation equations, choosing the optimization techniques from systematically tuning the algorithm. The usage of collaborative thread blocking, cache blocking, register re-use, vectorization and loop redistribution resulted in significant performance improvements.…”

Section: Related Workmentioning

confidence: 99%

Strategies to Improve the Performance of a Geophysics Model for Different Manycore Systems

Serpa

Cruz

Diener

et al. 2017

2017 International Symposium on Computer Architecture and High Performance Computing Workshops (SBAC-PADW)

View full text Add to dashboard Cite

Abstract-Many software mechanisms for geophysics exploration in Oil & Gas industries are based on wave propagation simulation. To perform such simulations, state-of-art HPC architectures are employed, generating results faster and with more accuracy at each generation. The software must evolve to support the new features of each design to keep performance scaling. Furthermore, it is important to understand the impact of each change applied to the software, in order to improve the performance as most as possible. In this paper, we propose several optimization strategies for a wave propagation model for five architectures: Intel Haswell, Intel Knights Corner, Intel Knights Landing, NVIDIA Kepler and NVIDIA Maxwell. We focus on improving the cache memory usage, vectorization, and locality in the memory hierarchy. We analyze the hardware impact of the optimizations, providing insights of how each strategy can improve the performance. The results show that NVIDIA Maxwell improves over Intel Haswell, Intel Knights Corner, Intel Knights Landing and NVIDIA Kepler performance by up to 17.9x.

show abstract

Characterization and Optimization Methodology Applied to Stencil Computations

Cited by 21 publications

References 2 publications

Distributed-Memory Load Balancing With Cyclic Token-Based Work-Stealing Applied to Reverse Time Migration

Distributed-Memory Load Balancing With Cyclic Token-Based Work-Stealing Applied to Reverse Time Migration

Performance prediction of finite-difference solvers for different computer architectures

Strategies to Improve the Performance of a Geophysics Model for Different Manycore Systems

Contact Info

Product

Resources

About