2014
DOI: 10.1002/cpe.3351
|View full text |Cite
|
Sign up to set email alerts
|

Optimizing the computation of a parallel 3D finite difference algorithm for graphics processing units

Abstract: This paper explores the possibilities of using a graphics processing unit for complex 3D finite difference computation via MUSTA-FORCE and WENO algorithms. We propose a novel algorithm based on the new properties of CUDA surface memory optimized for 2D spatial locality and compare it with 3D stencil computations carried out via shared memory, which is currently considered to be the best approach. A case study was performed for the extensive generation of a time series of 3D grids of arbitrary size used in the … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1

Citation Types

0
9
0

Year Published

2014
2014
2024
2024

Publication Types

Select...
5

Relationship

0
5

Authors

Journals

citations
Cited by 8 publications
(9 citation statements)
references
References 20 publications
(43 reference statements)
0
9
0
Order By: Relevance
“…[7][8][9]11,19,21 This happens because, although using row-major order enables some coalesced memory accesses for block warps, all blocks in the grid will compete for space in the cache. However, as each thread within a block will need its own data plus its six neighbors, this may lead to data access redundancy.…”
Section: Figurementioning
confidence: 99%
See 4 more Smart Citations
“…[7][8][9]11,19,21 This happens because, although using row-major order enables some coalesced memory accesses for block warps, all blocks in the grid will compete for space in the cache. However, as each thread within a block will need its own data plus its six neighbors, this may lead to data access redundancy.…”
Section: Figurementioning
confidence: 99%
“…This means that the chance of a cache miss increases as the grid size grows. [7][8][9]11,21 In CUDA, threads of the same block can communicate with each other through shared memory, which is a low-latency memory, but limited in total storage. According to Equation (5), at the time thread k asks for data in position (x + 1, y, z), thread k + 1 had requested this same data already, so it should be in cache memory.…”
Section: Figurementioning
confidence: 99%
See 3 more Smart Citations