Cache-efficient dynamic programming algorithms for multicores

Chowdhury, Rezaul; Ramachandran, Vijaya

doi:10.1145/1378533.1378574

Cited by 75 publications

(67 citation statements)

References 31 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Dynamic programs are usually described through recurrence relations that specify how to decompose sub-problems, and is typically implemented using a DP table where each cell holds the computed solution for one of these sub-problems. The table can be filled by visiting each cell once in some predetermined order, but recent research has shown that it is possible to achieve order-of-magnitude performance improvements over this standard implementation approach by developing divide-and-conquer implementation strategies that recursively partition the space of subproblems into smaller subspaces [4,[8][9][10][11]32].…”

Section: Overviewmentioning

confidence: 99%

“…Later, these concepts have been applied to divide-and-conquer in a disciplined way in [7,31]; these address divide-andconquer in the classical sense of [13] (Chapter 4), focusing on parallelism. In Bellmania, more focus is put on re-ordering of array reads and writes, following and mechanizing techniques related to DP from [8,9]. In fact, traditional parallelism is taking "for granted" for our aggregation operators, since they are associative and methods such as [17] apply rather trivially, and translated into the tactics Slice, Assoc, and Distrib.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Deriving divide-and-conquer dynamic programming algorithms using solver-aided transformations

Itzhaky

Singh

Solar-Lezama

et al. 2016

Proceedings of the 2016 ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and Applicatio

View full text Add to dashboard Cite

We introduce a framework allowing domain experts to manipulate computational terms in the interest of deriving better, more efficient implementations. It employs deductive reasoning to generate provably correct efficient implementations from a very high-level specification of an algorithm, and inductive constraint-based synthesis to improve automation. Semantic information is encoded into program terms through the use of refinement types.In this paper, we develop the technique in the context of a system called Bellmania that uses solver-aided tactics to derive parallel divide-and-conquer implementations of dynamic programming algorithms that have better locality and are significantly more efficient than traditional loop-based implementations. Bellmania includes a high-level language for specifying dynamic programming algorithms and a calculus that facilitates gradual transformation of these specifications into efficient implementations. These transformations formalize the divide-and-conquer technique; a visualization interface helps users to interactively guide the process, while an SMT-based back-end verifies each step and takes care of low-level reasoning required for parallelism.We have used the system to generate provably correct implementations of several algorithms, including some important algorithms from computational biology, and show that the performance is comparable to that of the best manually optimized code.

show abstract

Section: Overviewmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Deriving divide-and-conquer dynamic programming algorithms using solver-aided transformations

Itzhaky

Singh

Solar-Lezama

et al. 2016

Proceedings of the 2016 ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and Applicatio

View full text Add to dashboard Cite

show abstract

“…Therefore, C(G) has only (( Table 2 (the calculations are not difficult and are omitted for brevity). Therefore, the number of parallel steps required to execute all supernodes is of the sequential time complexity divided by the number of processors p. In a more recent work [10] we have shown that for both shared and distributed caches the depth of any GEP computation can be improved to O(n) while still matching its optimal sequential cache complexity by choosing tile sizes that depend only on the number of cores/processors and thus still remaining cache-oblivious. This is the maximum parallelism achievable when staying within the GEP framework.…”

Section: Subdag In G Corresponding To Any Supernode V Is Denoted By Smentioning

confidence: 99%

The Cache-Oblivious Gaussian Elimination Paradigm: Theoretical Framework, Parallelization and Experimental Evaluation

Chowdhury

Ramachandran

2010

Theory Comput Syst

Self Cite

View full text Add to dashboard Cite

We consider triply-nested loops of the type that occur in the standard Gaussian elimination algorithm, which we denote by GEP (or the Gaussian Elimination Paradigm). We present two related cache-oblivious methods I-GEP and C-GEP, both of which reduce the number of cache misses incurred (or I/Os performed) by the computation over that performed by standard GEP by a factor of √ M, where M is the size of the cache. Cache-oblivious I-GEP computes in-place and solves most of the known applications of GEP including Gaussian elimination and LU-decomposition without pivoting and Floyd-Warshall all-pairs shortest paths. Cache-oblivious C-GEP uses a modest amount of additional space, but is completely general and applies to any code in GEP form. Both I-GEP and C-GEP produce system-independent cacheefficient code, and are potentially applicable to being used by optimizing compilers for loop transformation.We present parallel I-GEP and C-GEP that achieve good speed-up and match the sequential caching performance cache-obliviously for both shared and distributed caches for sufficiently large inputs.We present extensive experimental results for both in-core and out-of-core performance of our algorithms. We consider both sequential and parallel implementations, and compare them with finely-tuned cache-aware BLAS code for matrix multiplication and Gaussian elimination without pivoting. Our results indicate that cacheoblivious GEP offers an attractive trade-off between efficiency and portability.This work was supported in part by NSF Grant CCF-0514876 and NSF CISE Research Infrastructure Grant EIA-0303609. This journal submission incorporates results on the cache-oblivious paradigm that were presented in preliminary form in [8] and [9].

show abstract

“…Chowdhury and Ramachandran [9] consider cache-complexity in both private-and shared-cache models for matrix-based computations, including all-pairs shortest paths algorithm of FloydWarshall. They also consider parallel dynamic programming algorithms in private-, shared-and multicore-cache models [10].…”

Section: A Prior Related Workmentioning

confidence: 99%

Parallel external memory graph algorithms

Arge

Goodrich

Sitchinava

2010

2010 IEEE International Symposium on Parallel &Amp; Distributed Processing (IPDPS)

View full text Add to dashboard Cite

Abstract-In this paper, we study parallel I/O efficient graph algorithms in the Parallel External Memory (PEM) model, one of the private-cache chip multiprocessor (CMP) models. We study the fundamental problem of list ranking which leads to efficient solutions to problems on trees, such as computing lowest common ancestors, tree contraction and expression tree evaluation. We also study the problems of computing the connected and biconnected components of a graph, minimum spanning tree of a connected graph and ear decomposition of a biconnected graph. All our solutions on a P -processor PEM model provide an optimal speedup of Θ(P ) in parallel I/O complexity and parallel computation time, compared to the single-processor external memory counterparts.

show abstract

Cache-efficient dynamic programming algorithms for multicores

Cited by 75 publications

References 31 publications

Deriving divide-and-conquer dynamic programming algorithms using solver-aided transformations

Deriving divide-and-conquer dynamic programming algorithms using solver-aided transformations

The Cache-Oblivious Gaussian Elimination Paradigm: Theoretical Framework, Parallelization and Experimental Evaluation

Parallel external memory graph algorithms

Contact Info

Product

Resources

About