FPGA Implementation of the Conjugate Gradient Method

Maslennikow, O.; Lepekha, Volodymyr; Sergyienko, Anatoli

doi:10.1007/11752578_63

Cited by 11 publications

(3 citation statements)

References 6 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Several FPGA implementations of CG have been proposed in the past. Given that the convergence of conjugate gradient method is sensitive to precision used, in [6], a fractional number system is proposed. In [7], a high performance architecture for CG, in which data blocking to partition large sparse matrices into square blocks, was proposed.…”

Section: Introduction and Previous Workmentioning

confidence: 99%

Efficient FPGA Implementation of Conjugate Gradient Methods for Laplacian System using HLS

Rampalli

Sehgal

Bindlish

et al. 2019

Proceedings of the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

View full text Add to dashboard Cite

In this paper, we study FPGA based pipelined and superscalar design of two variants of conjugate gradient methods for solving Laplacian equation on a discrete grid; the first version corresponds to the original conjugate gradient algorithm, and the second version corresponds to a slightly modified version of the same. In conjugate gradient method to solve partial differential equations, matrix vector operations are required in each iteration; these operations can be implemented as 5 point stencil operations on the grid without explicitely constructing the matrix. We show that a pipelined and superscalar design using high level synthesis written in C language leads to a significant reduction in latencies for both methods. When comparing these two, we show that the later has roughly two times lower latency than the former given the same degree of superscalarity. These reductions in latencies for the newer variant of CG is due to parallel implementations of stencil operation on subdomains of the grid, and dut to overlap of these stencil operations with dot product operations. In a superscalar design, domain needs to be partitioned, and boundary data needs to be copied, which requires padding. In 1D partition, the padding latency increases as the number of partitions increase. For a streaming data flow model, we propose a novel traversal of the grid for 2D domain decomposition that leads to 2 times reduction in latency cost involved with padding compared to 1D partitions. Our implementation is roughly 10 times faster than software implementation for linear system of dimension 10000 × 10000.

show abstract

Section: Introduction and Previous Workmentioning

confidence: 99%

Efficient FPGA Implementation of Conjugate Gradient Methods for Laplacian System using HLS

Rampalli

Sehgal

Bindlish

et al. 2019

Proceedings of the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

View full text Add to dashboard Cite

show abstract

“…These implementations include a Cholesky approach that achieved a performance increase of 50% over software on a APEX EP20K15000E FPGA [1]; a Jacobi solver implementation on a Xilinx VirtexII Pro XC2VP50 which achieved a speedup of 1.3 to 36.8 relative to a highend processor, depending on the matrix structure [2]; and two CG implementations. One of these implementations used the Logarithmic Number System (LNS) and reached up to 0.94 GFLOPS on a VirtexII-6000 [3], while the other used a rational number system representation and achieved 0.27 GFLOPS on a VirtexII Pro XC2VP4 [4]. Table 1 summarizes FPGA implementations of Conjugate Gradient method in terms of year of publication, number system, input problem structure, device and GFLOPS achieved.…”

Section: Introductionmentioning

confidence: 99%

A floating-point solver for band structured linear equations

Lopes¹,

Constantinides²,

Kerrigan³

2008

2008 International Conference on Field-Programmable Technology

View full text Add to dashboard Cite

Field Programmable Gate Arrays (FPGAs) have gradually been increasing their capacities and started to incorporate optimized coarse-grained modules such as BlockRAMs, multipliers, and even processors. These developments have extended their field of applications and one field that has been gaining significant interest is the acceleration of floatingpoint scientific computing. In this field, a recurring subtask is the solution of systems of linear equations. One well studied method that has proven to be very efficient in software and robust at finding such solutions is the Conjugate Gradient (CG) algorithm. In this paper we present a hardware CG method which takes advantage of the banded structure present in many common problems. With the flexibility provided by FPGAs, this implementation employs wide-parallelization to convert the per iteration computation time for an order n matrix with band width w from Θ(nw) clock cycles for a software implementation to Θ(n) in hardware. It also explores deep-pipelining so that solutions to P problems are produced every Θ(n) cycles opposed to every Θ(P nw) cycles in software. Results demonstrate that performances up to 32 GFLOPs are achievable on a Virtex5-330T FPGA and a software comparison reports significant speed-ups in relation to high-end CPUs.

show abstract

“…One uses a Logarithmic Number System (LNS) and achieves up to 1.1 GFLOPS on a VirtexII-6000 [30]. The other uses a rational number representation and achieves 0.27 GFLOPS using a VirtexII Pro XC2VP4 [31] and projects that it will be able to sustain 15 GFLOPS on a Virtex4-55. In contrast, we present a widely-parallelised and deeply-pipelined Conjugate Gradient method using the IEEE 754 [32] single precision floating point number representation.…”

Section: Previous Fpga Implementationsmentioning

confidence: 99%

A High Throughput FPGA-Based Floating Point Conjugate Gradient Implementation

Lopes

Constantinides

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Recent developments in the capacity of modern Field Programmable Gate Arrays (FPGAs) have significantly expanded their applications. One such field is the acceleration of scientific computation and one type of calculation that is commonplace in scientific computation is the solution of systems of linear equations. A method that has proven in software to be very efficient and robust for finding such solutions is the Conjugate Gradient (CG) algorithm. In this paper we present a widely-parallel and deeply-pipelined hardware CG implementation, targeted at modern FPGA architectures. This implementation is particularly suited for accelerating multiple small-to-medium sized dense systems of linear equations and can be used as a stand alone solver or as building block to solve higher order systems. In this paper it is shown that through parallelization it is possible to convert the computation time per iteration for an order n matrix from Θ(n 2 ) clock cycles on a micro-processor to Θ(n) on a FPGA. Through deep-pipelining it is also possible to solve several problems in parallel and maximize both performance and efficiency. I/O requirements are shown to be scalable and convergent to a constant value with the increase of matrix order. Post place-and-route results on a readily available VirtexII-6000 demonstrate sustained performance of 5 GFLOPS, and results on a Virtex5-330 indicate sustained performance of 35 GFLOPS. A comparison with an optimized software implementation running on a high-end CPU, demonstrate that this FPGA implementation represents a significant speed-up of at least an order of magnitude.

show abstract

FPGA Implementation of the Conjugate Gradient Method

Cited by 11 publications

References 6 publications

Efficient FPGA Implementation of Conjugate Gradient Methods for Laplacian System using HLS

Efficient FPGA Implementation of Conjugate Gradient Methods for Laplacian System using HLS

A floating-point solver for band structured linear equations

A High Throughput FPGA-Based Floating Point Conjugate Gradient Implementation

Contact Info

Product

Resources

About