Parallelizing Simulated Annealing-Based Placement Using GPGPU

Choong, Alexander; Beidas, Rami; Zhu, Jianwen

doi:10.1109/fpl.2010.17

Cited by 28 publications

(25 citation statements)

References 10 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…• The first group of algorithms allows all processors to work within the same region (often the entire grid), but restricts swaps that are being evaluated in parallel, such as being from independent sets [6,[31][32][33]. Some algorithms employ a speculative move proposal to further accelerate the algorithm [6], and a dependency checker, executed serially, is used to ensure that no hard conflict has occurred and that soft conflicts are resolved with recalculation.…”

Section: Previous Workmentioning

confidence: 99%

Scalable and deterministic timing-driven parallel placement for FPGAs

Wang

Lemieux

2011

Proceedings of the 19th ACM/SIGDA International Symposium on Field Programmable Gate Arrays

View full text Add to dashboard Cite

This thesis describes a parallel implementation of the timing-driven VPR 5.0 simulated-annealing placement engine. By partitioning the grid into regions and allowing distant data to grow stale, it is possible to consider a large number of nonconflicting moves in parallel and achieve a deterministic result. The full timingdriven placement algorithm is parallelized, including swap evaluation, boundingbox calculation and the detailed timing-analysis updates. The partitioned region approach slightly degrades the placement quality, but this is necessary to expose greater parallelism. We also suggest a method to recover the lost quality.In simulated annealing, runtime can be shortened at the expense of quality.Using this method, the serial placer can achieve a maximum speedup of 100X while quality metrics degrades as much as 100%. In contrast, the parallel placer can scale beyond 500X with all quality metrics degrading by less than 30%. Specifically, at the point where the parallel placer begins to dominate over the serial placer, the post-routing minimum channel width, wirelength and critical-path delay degrades 13%, 10% and 7% respectively on average compared to VPR's original algorithm, while achieving a 140X to 200X speedup 25 threads. Finally, it is shown that the amount of degradation in the parallel placer is independent of the number of threads used.ii

show abstract

Section: Previous Workmentioning

confidence: 99%

Scalable and deterministic timing-driven parallel placement for FPGAs

Wang

Lemieux

2011

Proceedings of the 19th ACM/SIGDA International Symposium on Field Programmable Gate Arrays

View full text Add to dashboard Cite

show abstract

“…While the algorithmic alterations made in [2] improved run-time by approximately 46% compared to [6], the maximum speed-up failed to scale for more than 16 threads. The main reason existing parallel placement annealers fail to scale is that the amount of sequential work in the approaches, including synchronization and communication, scales along with the problem size [1], [6], [2], causing the sequential runtime to quickly dominate the overall algorithm run-time as the number of threads increases due to Amdahl's Law. The second key issue with existing parallel annealing techniques is a significant reduction in quality-of-result as the amount of parallelism increases.…”

Section: Introductionmentioning

confidence: 98%

“…The most successful parallel annealing approaches to-date suffer from two main issues. The first issue is that of run-time scalability, where the run-time speedups of existing parallel annealers [1], [6], [2] do not scale beyond a small number of threads or processing cores. For example, in [6], the maximum speed-up obtained over a single thread was 21x, with a maximum speed-up of 17.5x over VPR [3]'s annealer.…”

Section: Introductionmentioning

confidence: 99%

A scalable, serially-equivalent, high-quality parallel placement methodology suitable for modern multicore and GPU architectures

Fobel

Gréwal

Stacey

2014

2014 24th International Conference on Field Programmable Logic and Applications (FPL)

View full text Add to dashboard Cite

Placement and routing run-times continue to dominate the automated FPGA design flow. As the size of FPGA architectures continue to grow exponentially, it remains critical to develop parallel tools for FPGA design where the amount of exposed concurrent work scales with the size of the designs to be synthesized. In this paper, we propose a novel algorithm for parallel placement, based on simulated annealing, where the amount of parallel work directly scales with the size of the net-list to be placed. Our approach concurrently evaluates and conditionally applies very large sets of non-conflicting swaps using common parallel computing primitives, including stream compaction, category reduction, and sort. While our design is suitable for targeting all modern parallel computing platforms, we present results from our implementation which targets NVIDIA's CUDA platform, where we achieve a mean speed-up of 19x over VPR with post-routing critical-path-delay and wire-length quality that matches or exceeds VPR. We believe that this work is an important step towards the development of a scalable, highquality placement tool.

show abstract

“…While parallelization techniques [2], [3], [4] are ultimately required to dramatically reduce runtime and therefore bring appreciably change to user experience, it is recognized that they are often limited by Amdahl's law. This is exactly the case as in routing [5], [6], [7], which is well known to occupy a significant chunk of compilation time.…”

Section: Introductionmentioning

confidence: 99%

Timing-Driven Routing of High Fanout Nets

Chen

Zhu

Zhang

2011

2011 21st International Conference on Field Programmable Logic and Applications

Self Cite

View full text Add to dashboard Cite

It has been observed in the past that the PathFinder routing algorithm runtime could be hampered by high fanout nets, primarily due to the time spent on the initialization of the priority queue. However, a solution has only been reported for routability/wirelength driven routers. In this paper, we report two heuristics that address the same issue for timing-driven routers. We show that on standard MCNC benchmarks, the proposed techniques can achieve 1.53 and 1.56 time speed up against the versatile placement and router (VPR), while achieving the same quality of result.

show abstract

Parallelizing Simulated Annealing-Based Placement Using GPGPU

Cited by 28 publications

References 10 publications

Scalable and deterministic timing-driven parallel placement for FPGAs

Scalable and deterministic timing-driven parallel placement for FPGAs

A scalable, serially-equivalent, high-quality parallel placement methodology suitable for modern multicore and GPU architectures

Timing-Driven Routing of High Fanout Nets

Contact Info

Product

Resources

About