Efficient methods for out-of-order load/store execution for high-performance soft processors

Wong, Henry; Betz, Vaughn; Rose, Jonathan

doi:10.1109/fpt.2013.6718409

Cited by 7 publications

(8 citation statements)

References 11 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…These scheme is also impractical, as the LSQ would have as many entries as the number of static memory accesses in the application, which would be unrealistically large in all but the most trivial of cases. Others have shown that both the critical path (assuming single-cycle accesses) and resource requirement demands grow as a function of the number of LSQ entries [21]; our implementation results confirm this observation.…”

Section: Supplying a Sequential Order To The Lsqsupporting

confidence: 86%

“…Although our design exhibits ample parallelism and performs most operations concurrently, some functionalities cannot be implemented in constant time-for instance, to bypass data from the store to the load queue, one needs to check the store queue from the head to the tail to find the last conflicting data, and this requires at best O(log n) time for an n-depth queue. This sensitivity to the number of queue entries is in line with results reported by others in conventional LSQ designs-previous efforts to implement conventional LSQs in FPGAs have exhibited the same trends of resource and clock degradation with queue size [21]. These results motivate us to consider alternative design options in the future-our group allocation policy is generally applicable and can be incorporated into different queue architectures.…”

Section: Resource Utilization and Timing Analysissupporting

confidence: 85%

See 1 more Smart Citation

An Out-of-Order Load-Store Queue for Spatial Computing

Josipović

Brisk

Ienne

2017

ACM Trans. Embed. Comput. Syst.

View full text Add to dashboard Cite

The efficiency of spatial computing depends on the ability to achieve maximal parallelism. This necessitates memory interfaces that can correctly handle memory accesses that arrive in arbitrary order while still respecting data dependencies and ensuring appropriate ordering for semantic correctness. However, a typical memory interface for out-of-order processors (i.e., a load-store queue) cannot immediately meet these requirements: a different allocation policy is needed to achieve out-of-order execution in spatial systems that naturally omit the notion of sequential program order, a fundamental piece of information for correct execution. We show a novel and practical way to organize the allocation for an out-of-order load-store queue for spatial computing. The main idea is to dynamically allocate groups of memory accesses (depending on the dynamic behavior of the application), where the access order within the group is statically predetermined (for instance by a high-level synthesis tool). We detail the construction of our load-store queue and demonstrate on a few practical cases its advantages over standard accelerator-memory interfaces.

show abstract

Section: Supplying a Sequential Order To The Lsqsupporting

confidence: 86%

Section: Resource Utilization and Timing Analysissupporting

confidence: 85%

An Out-of-Order Load-Store Queue for Spatial Computing

Josipović

Brisk

Ienne

2017

ACM Trans. Embed. Comput. Syst.

View full text Add to dashboard Cite

show abstract

“…To avoid pipeline stalls due to unpredictable memory accesses, a circuit can use additional logic to handle memory accesses at runtime [20]. If proven safe to do so, the logic should allow loads from later loop iterations to be executed without waiting for stores from earlier iterations to commit.…”

Section: Runtime Memory Disambiguation In Hlsmentioning

confidence: 99%

“…Such functionality is most often implemented as a load-store queue (LSQ). Most LSQs aimed at HLS use a content-addressable memory (CAM) structure to implement the load and store queue [20], [23], with a similar operating principle as LSQs used in out-of-order CPUs [24]. CAMs map poorly to FPGA technology resulting in a high critical path and resource usage overhead [20], [25].…”

Section: Runtime Memory Disambiguation In Hlsmentioning

confidence: 99%

“…Most LSQs aimed at HLS use a content-addressable memory (CAM) structure to implement the load and store queue [20], [23], with a similar operating principle as LSQs used in out-of-order CPUs [24]. CAMs map poorly to FPGA technology resulting in a high critical path and resource usage overhead [20], [25]. Our LSQ design is fundamentally different from previous LSQs in that we use shift registers instead of CAMs.…”

Section: Runtime Memory Disambiguation In Hlsmentioning

confidence: 99%

See 1 more Smart Citation

A High-Frequency Load-Store Queue with Speculative Allocations for High-Level Synthesis

Szafarczyk,

Nabi,

Vanderbauwhede

2023

2023 International Conference on Field Programmable Technology (ICFPT)

View full text Add to dashboard Cite

Dynamically scheduled high-level synthesis (HLS) achieves a higher throughput on codes with unpredictable memory accesses compared to statically scheduled HLS. However, the increased throughput comes at the price of increased resource usage and critical path length, resulting in lower clock frequency. The decrease in clock frequency can be significant, often nullifying any throughput improvements over static scheduling. Recent work presented methods for combining static and dynamic scheduling to achieve high throughput circuits with a fast critical path for dynamic codes. However, circuits that require dynamically scheduled memory still suffer from a decreased frequency. This paper fills this gap by presenting a method for achieving dynamically scheduled memory operations in HLS with a high frequency. Dynamic scheduling of memory operations is realized with a load-store queue (LSQ). We present a novel LSQ design adapted to the nature of spatial architectures with aggressive specialization to the target code -a unique opportunity in HLS. Our LSQ design works for both on-chip and off-chip memory and is integrated with a compiler that combines dynamic and static scheduling. We show a method to speculatively allocate addresses to our LSQ, significantly increasing pipeline parallelism in codes that could not benefit from an LSQ before. In stark contrast to traditional load value speculation, our approach adds no overhead on misspeculation. On a set of ten benchmarks, we show that our approach can achieve an up to 10× speedup on average against static HLS, and an up to 4× speedup against dynamic HLS that uses an LSQ from previous work, while also using several times fewer resources and scaling to larger queues.

show abstract