Performance Analysis of Prefetching Thread for Linked Data Structure in CMPs

et al. 2011

Int J Parallel Prog

Self Cite

Helper threaded prefetching based on Chip Multiprocessor is a well known approach to reducing memory latency and has been explored in linked data structures accesses. However, conventional helper threaded prefetching often suffers from useless prefetches and cache thrashing, which affect its effectiveness. In this paper, we first analyzed the shortcomings of conventional helper threaded prefetching for linked data structures. Then we proposed an improved helper threaded prefetching, Skip Helper Threaded Prefetching, for hotspots with two level data traversals. Our solution is to profile the applications and balance delinquent loads between main thread and prefetching thread based on the characteristic of operations in their hotspots. Evaluations show that the proposed solution improves average performance by 8.9% (-O2) and 8.5% (-O3) over the conventional helper threaded prefetching that greedily prefetches all delinquent loads. We also compare our proposal with the active threaded prefetching which synchronizes with main thread by semaphore, and find that our proposal provides better performance for the targeted applications.

Section: Helper Threaded Prefetching Designmentioning

confidence: 99%

“…As indicated in [20,21], the characteristics of operations in hotspot affect the performance of the helper thread. Hence, we define Computation/Access Latency Ratio (CALR), which is the result of cycles in computation divided by cycles in memory accesses.…”

Section: Terminologymentioning

confidence: 99%

The Performance Optimization of Threaded Prefetching for Linked Data Structures

et al. 2011

Int J Parallel Prog

Self Cite

“…To tolerate memory access latency, there have been a plethora of proposals for data prefetching [2][3][4][5][6][7][8][9][10][11][12][13][14][15][16][17][18][19][20]. Data prefetching techniques improve performance by predicting future memory accesses and fetch them in cache before they are accessed.…”

mentioning

confidence: 99%

“…Helper thread based prefetching techniques [19][20][21][22][23][24][25][26][27] are promising methods to deal with irregular access patterns that are hard to predict. However, because LDS are traversed in a way that prevents individual accesses from being overlapped, conventional helper thread based prefetching techniques have some problems with LDS programs.…”

mentioning

confidence: 99%

“…With the advent of chip multiprocessors (CMP) architectures, thread-based prefetching and speculative execution techniques [9][10][11][12][13][14][15][16][17][18][19][20] utilizes a helper thread to boost the performance of main thread by prefetching data into cache. Helper thread based prefetching techniques [19][20][21][22][23][24][25][26][27] are promising methods to deal with irregular access patterns that are hard to predict.…”

mentioning

confidence: 99%

See 1 more Smart Citation

Estimating Effective Prefetch Distance in Threaded Prefetching for Linked Data Structures

et al. 2012

Int J Parallel Prog

Self Cite

Helper threaded prefetching based on chip multiprocessor has been shown to reduce memory latency and improve overall system performance, and has been explored in linked data structures accesses. In our earlier work, we had proposed an effective threaded prefetching technique that balances delinquent loads between main thread and helper thread to improve effectiveness of prefetching. In this paper, we analyze memory access characteristic of specific application to estimate effective prefetch distance range for our proposed threaded prefetching technique. The effect of hardware prefetchers on the estimation is also exploited. We discuss key design issues of our proposed method and present preliminary experimental results. Our experimental evaluations indicated that the bounded range of effective prefetch distance can be determined using our method, and the optimal prefetch distances can be determined based on the estimated effective prefetch distance range by few trial runs.

Reducing Cache Pollution of Threaded Prefetching by Controlling Prefetch Distance

2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops &Amp; PhD Forum

et al. 2012

Self Cite

Threaded prefetching based on Chip Multiprocessor (CMP) issues memory requests for data needed later by the main computation, and therefore may lead to increased stress on limited shared cache space and bus bandwidth. In our earlier work, we had proposed an effective threaded prefetching technique that selects proper prefetch distance for specific application to improve the timeliness of prefetching. In this paper, we first estimate the upper limit of prefetch distance for specific application in our proposed threaded prefetching technique, and then analyze the effect of increasing prefetch distance on shared cache pollution. Our experimental evaluations indicated that the bounded range of effective prefetch distance can be determined using our method, and the shared cache pollution can be reduced by controlling prefetch distance in our proposed threaded prefetching technique.