Shared-Memory Parallel Probabilistic Graphical Modeling Optimization: Comparison of Threads, OpenMP, and Data-Parallel Primitives

Perciano, Talita; Heinemann, Colleen; Camp, David; Lessley, Brenton; Bethel, E. Wes

doi:10.1007/978-3-030-50743-5_7

Cited by 4 publications

(3 citation statements)

References 25 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…One primary difference between these previous works, except for Perciano, et al, 2020 [17], and our work here is the deeper introspection provided by using detailed hardware performance counters. These additional metrics offer the ability to better understand why a given code performs better or worse in a particular set of circumstances, and also helps to provide a more sound basis for performance analysis.…”

Section: B Comparing Traditional and Vtk-m Implementationsmentioning

confidence: 84%

“…The three cases we present all exhibit different aspects of why a method might have better or worse runtime than another. In some cases, the way an algorithm is implemented, such as VTK vs. VTK-m, can have a dramatic impact on overall number of instructions, a fact that is corroborated by other recent studies (c.f., [17]). In other cases, the buffer management needed to implement a complex, multi-stage processing pipeline may trigger more memory movement instructions, which may be more expensive and result in higher CPI values, and we see evidence of this in two of the examples.…”

Section: F Discussion Of Resultsmentioning

confidence: 99%

See 1 more Smart Citation

Performance Analysis of Traditional and Data-Parallel Primitive Implementations of Visualization and Analysis Kernels

Bethel¹,

Camp²,

Perciano³

et al. 2020

Preprint

Self Cite

View full text Add to dashboard Cite

Measurements of absolute runtime are useful as a summary of performance when studying visualization and analysis methods on computational platforms of increasing concurrency and complexity. We can obtain even more insights by measuring and examining more detailed measures from hardware performance counters, such as the number of instructions executed by an algorithm implemented in a particular way, the amount of data moved to/from memory, memory hierarchy utilization levels via cache hit/miss ratios, and so forth. This work focuses on performance analysis on modern multi-core platforms of three different visualization and analysis kernels that are implemented in different ways: one is "traditional", using combinations of C++ and VTK, and the other uses a data-parallel approach using VTK-m. Our performance study consists of measurement and reporting of several different hardware performance counters on two different multi-core CPU platforms. The results reveal interesting performance differences between these two different approaches for implementing these kernels, results that would not be apparent using runtime as the only metric.

show abstract

Section: B Comparing Traditional and Vtk-m Implementationsmentioning

confidence: 84%

Section: F Discussion Of Resultsmentioning

confidence: 99%

Performance Analysis of Traditional and Data-Parallel Primitive Implementations of Visualization and Analysis Kernels

Bethel¹,

Camp²,

Perciano³

et al. 2020

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…The emphasis is not merely on parallelization but also on meticulous optimizations. These refinements strategically curtail layer synchronization overheads and mitigate the intricacies tied to race conditions, ensuring the algorithm's robustness [3]. Concurrently, a discerning evaluation measures the algorithm's performance enhancements, specifically gauging the speedup in relation to the count of engaged threads.…”

Section: Introductionmentioning

confidence: 99%

A parallel Breadth-First Search using shared memory level-synchronization

Zheng

2024

ACE

View full text Add to dashboard Cite

Breadth-first search (BFS) stands as a cornerstone in graph exploration techniques, enabling systematic traversal of a provided graph. As the digital era continues to burgeon, there has been a marked upswing in the need to process vast graph-based data sets. To harness the power of such data effectively, it becomes imperative to use computational tools efficiently. Parallelizing BFS emerges as a pivotal strategy in this regard, leveraging the expansive capabilities of multiprocessor systems to maximize efficiency. This manuscript introduces a level-synchronous parallel BFS that is predicated on the shared-memory model. Recognizing the potential pitfalls of such an approach, especially regarding overhead induced by implicit barriers and critical sections, meticulous optimization techniques are infused into the model. These strategies are not mere afterthoughts; they are woven into the fabric of the design, ensuring smooth operations even when scaled. To validate the efficacy of this model, a rigorous assessment is carried out using the Graph500 benchmark. This offers insights into the performance scale of the parallel BFS algorithm, especially focusing on its speedup in relation to the number of operational threads. Concluding this exploration, the paper delineates prospective avenues for refining and further enhancing the proposed parallel implementation, aiming for even greater efficiencies in future endeavors.

show abstract