OpenACC + Athread collaborative optimization of Silicon-Crystal application on Sunway TaihuLight

Liang, Jianguo; Hua, Rong; Zhu, Wenqiang; Ye, Yuxi; Fu, You; Zhang, Hao

doi:10.1016/j.parco.2022.102893

Cited by 3 publications

(4 citation statements)

References 9 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In [49], the authors introduced a novel sequence alignment technique called ESA. This algorithm is implemented on the Sunway TaihuLight architecture and is capable of performing both local and global alignment.…”

Section: Literature Reviewmentioning

confidence: 99%

PSALR: Parallel Sequence Alignment for long Sequence Read with Hash model

Nasrin,

Amin,

Sima

et al. 2024

Preprint

View full text Add to dashboard Cite

Sequence alignment and genome mapping pose significant challenges, primarily focusing on speed and storage space requirements for mapped sequences. With the ever-increasing volume of DNA sequence data, it becomes imperative to develop efficient alignment methods that not only reduce storage demands but also offer rapid alignment. This study introduces the Parallel Sequence Alignment with a Hash-Based Model (PSALR) algorithm, specifically designed to enhance alignment speed and optimize storage space while maintaining utmost accuracy. In contrast to other algorithms like BLAST, PSALR efficiently indexes data using a hash table, resulting in reduced computational load and processing time. This algorithm utilizes data compression and packetization with conventional bandwidth sizes, distributing data among different nodes to reduce memory and transfer time. Upon receiving compressed data, nodes can seamlessly perform searching and mapping, eliminating the need for unpacking and decoding at the destination. As an additional innovation, PSALR not only divides sequences among processors but also breaks down large sequences into sub-sequences, forwarding them to nodes. This approach eliminates any restrictions on query length sent to nodes, and evaluation results are returned directly to the user without central node involvement. Another notable feature of PSALR is its utilization of overlapping sub-sequences within both query and reference sequences. This ensures that the search and mapping process includes all possible sub-sequences of the target sequence, rather than being limited to a subset. Performance tests indicate that the PSALR algorithm outperforms its counterparts, positioning it as a promising solution for efficient sequence alignment and genome mapping.

show abstract

Section: Literature Reviewmentioning

confidence: 99%

PSALR: Parallel Sequence Alignment for long Sequence Read with Hash model

Nasrin,

Amin,

Sima

et al. 2024

Preprint

View full text Add to dashboard Cite

show abstract

“…To verify the effectiveness and universality of this scheme, a core group of the SW26010p multi-core processor is used as the test platform for this experiment. The computational tasks are loaded asynchronously to the slave core for execution with the help of the high-performance threading library Athread [25]. The following schemes will be tested in this experiment: Validity test of the bracketing memory allocation algorithm; implementation of serial SpMV algorithm based on main kernel; implementation of SpMV algorithm based on master-slave acceleration; implementation of x optimization algorithm based on slave architecture LRU and LRU-K and x access optimization algorithm based on slave architecture ARC.…”

Section: Experimental Environment and Experimental Schemementioning

confidence: 99%

Research on SpMV Implementation and vector x Hit Rate Optimization for SW26010p Many-Core Platform

Wei

Jing

et al. 2022

Preprint

View full text Add to dashboard Cite

In recent years, With the development of computer hardware, the supercomputer industry has ushered in a stage of rapid development, and its architecture has also evolved from traditional multi-core to many-core and heterogeneous many-core. Among them, sunway Many-Core Platform series with completely independent intellectual property rights is the representative of China’s supercomputing heterogeneous many-core processors. As a computing kernel, SpMV (sparse matrix-vector multiplication) is of great significance in scientific and engineering computing whose computing performance often has a great impact on the overall performance of applications. The article analyzes the master-slave acceleration architecture of the SW26010p Many-Core Platform processor and the implementation of the sparse matrix in the CSR storage format on the SW26010p Many-Core Platform. Due to the limited memory of the slave core of the SW26010p, the vector data stored in large-scale SpMV cannot be satisfied, resulting in a long memory access time and reduced performance. To solve this problem and optimize the calculation performance of SpMV, this paper has launched a research on the optimization strategy of SpMV for SW26010p Many-Core Platform. Firstly, we propose a method of assigning tasks by the number of rows in which the non-zero elements are located to solve the load balancing problem among slave cores. Secondly, we propose an adaptive memory allocation algorithm for LDM to achieve the optimal use of LDM memory. Thirdly, according to the refined division of the LDM space, various algorithms such as the dynamic and static double cache algorithm based on the secondary core architecture LRU and LUR-k, and the dynamic and static cache elimination algorithm based on the secondary core architecture ARC are proposed to improve the hit rate of vector x respectively. the performance of SpMV is optimized by reducing communication time and improving calculation and memory access ratio. Finally, several representative sparse matrices are selected from matrix set (Market) and tested, and the performance of several algorithms is analyzed. The results show that compared with the traditional method, the overall x hit ratio of our scheme is greatly improved, and the master-slave acceleration ratio is also greatly improved, the maximum acceleration ratio can reach more than 20 times and the average speed-up ratio can reach 10.5 times, which has a very good optimization effect. Meanwhile, the optimization methods adopted in this paper can be used for reference for other complex applications of SW26010p.

show abstract

“…9,10,18 Our work was mainly carried out on this basis, and we completed the parallel acceleration of OpenACC based bulk silicon MD simulation program on Sunway TaihuLight, 19 and then further improved its performance by using OpenACC+Athread. 20 Ami Marowka points out that the 3P challenges of high-performance programming-performance, portability, and productivity-have become more difficult than ever in the era of heterogeneous computing. 21 Directives strive to offer portability without losing performance and are one of the most portable and productive programming models.…”

Section: Related Surveys and Our Contributionsmentioning

confidence: 99%

“…To give full play to the advantages of multi‐scale parallel computing mode, Hou et al independently developed an efficient and highly scalable MD simulation program for crystalline silicon, and carried out large‐scale heterogeneous parallel computing tests on Mole8.5 and Tianhe‐1A 9,10,18 . Our work was mainly carried out on this basis, and we completed the parallel acceleration of OpenACC based bulk silicon MD simulation program on Sunway TaihuLight, 19 and then further improved its performance by using OpenACC+Athread 20 …”

Section: Introductionmentioning

confidence: 99%

A novel acceleration method for molecular dynamics of crystal silicon on GPUs using OpenACC

Liang

Hua

et al. 2022

Softw Pract Exp

Self Cite

View full text Add to dashboard Cite

Compared with CUDA and OpenCL, OpenACC has the advantages of simple programming, openness, and good portability for GPU acceleration. An OpenMP/OpenACC implementation for molecular dynamics of silicon crystal on GPUs is proposed. First, to make effective use of vectorization and streaming, data structure conversion and data dependence elimination are designed.Second, the parallel version on the single GPU is realized by adding OpenACC guidance sentences, with very few modifications. Third, a patch block strategy is proposed to realize the parallel version on single machine multi-GPUs using OpenMP+OpenACC, which greatly simplifies the construction of shadow area and the exchange of shadow area data. Experimental results show that 23 to 25 speedup is achieved for the single GPU at different scales over the serial program on Intel(R) Xeon(R) CPU E5-2690 v4, and 6.37 speedup is achieved over the single GPU when the number of atoms reaches 2,097,152 on 8GPUs on single machine.

show abstract

OpenACC + Athread collaborative optimization of Silicon-Crystal application on Sunway TaihuLight

Cited by 3 publications

References 9 publications

PSALR: Parallel Sequence Alignment for long Sequence Read with Hash model

PSALR: Parallel Sequence Alignment for long Sequence Read with Hash model

Research on SpMV Implementation and vector x Hit Rate Optimization for SW26010p Many-Core Platform

A novel acceleration method for molecular dynamics of crystal silicon on GPUs using OpenACC

Contact Info

Product

Resources

About