As multithreaded and reconfigurable logic architectures play an increasing role in high-performance computing (HPC), the scientific community is in need for new programming models for efficiently mapping existing applications to the new parallel platforms. In this paper, we show how we can effectively exploit tightly coupled fine-grained parallelism in architectures such as GPU and FPGA to speedup applications described by uniform recurrence equations. We introduce the concept of rolling partial-prefix sums to dynamically keep track of and resolve multiple dependencies without having to evaluate intermediary values. Rolling partial-prefix sums are applicable in low-latency evaluation of dynamic programming problems expressed as uniform or affine equations. To assess our approach, we consider two common problems in computational biology, hidden Markov models (HMMER) for protein motif finding and the Smith-Waterman algorithm. We present a platform independent, linear time solution to HMMER, which is traditionally solved in bilinear time, and a platform independent, sub-linear time solution to Smith-Waterman, which is normally solved in linear time.Keywords: Dynamic Programming, HMMER, Protein-Motif Finding, GPUs, Parallelization, Computational Biology
GENERAL ROLLING PARTIAL PREFIX-SUMS ALGORITHMLet D be a finite domain of points. Each point can corresponds to a unique sub-problem or cell in a dynamic programming matrix. Let F be a function from D to a "result" domain Σ (e.g., the real numbers) that corresponds to the computation of the cost point in D. We seek to compute the values F (d) for a point d ∈ D. Let (Σ, ∧) form a commutative semigroup, i.e., the operator ∧ is a commutative, associative binary operator on results.Suppose that F (d) is computable for any d ∈ D as follows:where the summary is the natural extension of ∧ from two to any nonzero number of arguments. The summary operator maps two or more values in the results domain into a single value in the same domain. The function f i (d) is a mapping from multiple (finite number of) points in the domain D to one element in the results domain Σ. Here we consider only monadic recurrences where the function can be written as follows:where ⊕ is a binary extension operator on the results F (d ′ ) and h i (d) ∈ Σ is a "local" function that depends only on d, such as a look-up table and can be computed without the knowledge of any F (d). The relation d ′ < d must be satisfied, according to a partial order <, in order to avoid cyclic dependencies. The minimal elements of the partial order are "base" cases. A subset B of D is said to be "sufficient" for d if every path of dependency from d back to the base cases passes through an element of B. The nature of this dependency imposes a sequential execution of function F as dictated by the partial order. Therefore the number of algorithmic time-steps for sequential execution grows as the size of domain D modulo <, i.e., equal to total number of sets in D such that any two elements from different sets follow t...