2011
DOI: 10.1109/tpds.2010.107
|View full text |Cite
|
Sign up to set email alerts
|

Exploiting Memory Access Patterns to Improve Memory Performance in Data-Parallel Architectures

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
82
0
1

Year Published

2013
2013
2019
2019

Publication Types

Select...
4
3
2

Relationship

0
9

Authors

Journals

citations
Cited by 162 publications
(83 citation statements)
references
References 15 publications
0
82
0
1
Order By: Relevance
“…This type of memory is advantageous when accessing large and contiguous regions. As long as the memory access pattern is optimized, it can effectively handle hundreds or thousands simultaneous data read or write transactions [26]. On the other hand, frequent lightweight memory accesses to random regions of the global memory lead to bottlenecks in data transfer due to the high latency, resulting in a serious decrease of the performance of the parallelized application.…”
Section: A Gpu Processorsmentioning
confidence: 99%
“…This type of memory is advantageous when accessing large and contiguous regions. As long as the memory access pattern is optimized, it can effectively handle hundreds or thousands simultaneous data read or write transactions [26]. On the other hand, frequent lightweight memory accesses to random regions of the global memory lead to bottlenecks in data transfer due to the high latency, resulting in a serious decrease of the performance of the parallelized application.…”
Section: A Gpu Processorsmentioning
confidence: 99%
“…The DL [19] work studies an Array-of-Structureof-Tiled-Array (ASTA) layout and in-place data marshaling for improving the device memory throughput for GPU. Jang et al [10] used a mathematical model and algorithms to analyze data access patterns and target loop vectorization and GPU memory selection with different patterns. Zhang et al [23] proposed a library to reduce irregularities in GPU programs through a level of indirection and job swapping to improve branch and memory divergence.…”
Section: Related Workmentioning
confidence: 99%
“…High application performance relies heavily on efficient memory bandwidth utilizations. Though GPUs usually have a wider memory interface than CPUs, performance would be suboptimal in the presence of insufficient memory coalescing [10], [20], [23].…”
Section: Introductionmentioning
confidence: 99%
“…The CUDA grid and block indexing is column-major ordered, with the 'x' direction along the columns and the 'y' direction is along the rows, and the threads are scheduled in that order. Thus, when copying arrays from global memory to shared memory, it is important that the access pattern of the memory being copied matches the access pattern of the thread scheduler [34]. Since C arrays are row-major, a shared memory array A should be allocated (counter- Since in MATLAB (when using meshgrid() to form the x and y arrays of Ψ) the x direction of Ψ is along the rows of the Ψ array, and the y direction is along the columns, and when in the MEX file, this is transposed, then, as mentioned in Sec.…”
Section: Two-dimensional Specific Code Designmentioning
confidence: 99%