Efficient Control and Communication Paradigms for Coarse-Grained Spatial Architectures

Pellauer, Michael; Parashar, Angshuman; Adler, Michael; Ahsan, Bushra; Allmon, R.; Crago, Neal; Fleming, Kermin; Gambhir, Mohit; Jaleel, Aamer; Krishna, Tushar; Lustig, Daniel; Maresh, Stephen; Pavlov, V. V.; Rayess, Rachid; Zhai, Antonia; Emer, Joel

doi:10.1145/2754930

Cited by 14 publications

(9 citation statements)

References 24 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Spatial or dataflow computing distributes the control and eliminates the requirement and expectation of static reasoning: each operation executes as soon as all of its inputs arrive, and physical operators pass along control "tokens" and the data they produce [16]. Different authors exploited latency-insensitive protocols to construct dynamic, high-performance circuits.…”

Section: Related Workmentioning

confidence: 99%

An Out-of-Order Load-Store Queue for Spatial Computing

Josipović

Brisk

Ienne

2017

ACM Trans. Embed. Comput. Syst.

View full text Add to dashboard Cite

The efficiency of spatial computing depends on the ability to achieve maximal parallelism. This necessitates memory interfaces that can correctly handle memory accesses that arrive in arbitrary order while still respecting data dependencies and ensuring appropriate ordering for semantic correctness. However, a typical memory interface for out-of-order processors (i.e., a load-store queue) cannot immediately meet these requirements: a different allocation policy is needed to achieve out-of-order execution in spatial systems that naturally omit the notion of sequential program order, a fundamental piece of information for correct execution. We show a novel and practical way to organize the allocation for an out-of-order load-store queue for spatial computing. The main idea is to dynamically allocate groups of memory accesses (depending on the dynamic behavior of the application), where the access order within the group is statically predetermined (for instance by a high-level synthesis tool). We detail the construction of our load-store queue and demonstrate on a few practical cases its advantages over standard accelerator-memory interfaces.

show abstract

Section: Related Workmentioning

confidence: 99%

An Out-of-Order Load-Store Queue for Spatial Computing

Josipović

Brisk

Ienne

2017

ACM Trans. Embed. Comput. Syst.

View full text Add to dashboard Cite

show abstract

“…Spatial computation is a paradigm that breaking the application's dataflow into regions, and these regions are mapped to some subset of the hardware resources, including functional units, interconnection network and storage, in the form of producer-consumer pipeline [7]. Spatial computation has some similarity with the multicore models.…”

Section: Spatial Computationmentioning

confidence: 99%

“…In Fig. 2(a)(c), it shows the patterns of a program, totally parallelizable and loop-data-dependent, in which the circles with X y represents instruction X of iteration Y, and the squares with x y represents data read by instruction X Y [7]. Fig.…”

Section: Processing Element Array Processing Elementmentioning

confidence: 99%

See 1 more Smart Citation

The Advantages and Challenges of Spatial Computation

2017

Proceedings of 2017 the 7th International Workshop on Computer Science and Engineering

View full text Add to dashboard Cite

Recent years witness great interest to extract more performance and energy efficiency from spatial computation which gains great reputation on various domains, ranging from signal processing to highperformance computation. Its success does not only rely on the performance boosting, but also energy efficiency which is a severe problem for the popular portable and wearable devices. Spatial computation has gradually become a part of the mainstream computing infrastructure. This paper offers an observation on the issues of processors, reviews some recent researches, discusses the advantages of spatial computation on flexibility, utilization, scalability and energy efficiency, as well as the challenges from programmability, compatibility, locality, etc.

show abstract

“…For example, the convolution kernel (that executes for the majority of execution time in ResNeXt [1]) is a 7-deep perfectly nested loop. Variations of dataflow accelerators, like systolic arrays (e.g., Tensor Processing Unit), coarse-grained reconfigurable arrays (CGRAs), and spatial architectures are repeatedly being demonstrated as a promising accelerator for these power and performancecritical loops [2][3][4][5][6][7][8][9][10]. As shown in Figure 1, dataflow accelerators, in general, comprise an array of processing elements aka PEs (where PEs are function-units with little local control) and noncoherent scratchpad memories (SPM) that allow concurrent execution and explicit data management.…”

Section: Introductionmentioning

confidence: 99%

dMazeRunner

Dave

Kim

Avancha

et al. 2019

ACM Trans. Embed. Comput. Syst.

View full text Add to dashboard Cite

Dataflow accelerators feature simplicity, programmability, and energy-efficiency and are visualized as a promising architecture for accelerating perfectly nested loops that dominate several important applications, including image and media processing and deep learning. Although numerous accelerator designs are being proposed, how to discover the most efficient way to execute the perfectly nested loop of an application onto computational and memory resources of a given dataflow accelerator ( execution method ) remains an essential and yet unsolved challenge. In this paper, we propose dMazeRunner -- to efficiently and accurately explore the vast space of the different ways to spatiotemporally execute a perfectly nested loop on dataflow accelerators (execution methods). The novelty of dMazeRunner framework is in: i) a holistic representation of the loop nests, that can succinctly capture the various execution methods, ii) accurate energy and performance models that explicitly capture the computation and communication patterns, data movement, and data buffering of the different execution methods, and iii) drastic pruning of the vast search space by discarding invalid solutions and the solutions that lead to the same cost. Our experiments on various convolution layers (perfectly nested loops) of popular deep learning applications demonstrate that the solutions discovered by dMazeRunner are on average 9.16× better in Energy-Delay-Product (EDP) and 5.83× better in execution time, as compared to prior approaches. With additional pruning heuristics, dMazeRunner reduces the search time from days to seconds with a mere 2.56% increase in EDP, as compared to the optimal solution.

show abstract

Efficient Control and Communication Paradigms for Coarse-Grained Spatial Architectures

Cited by 14 publications

References 24 publications

An Out-of-Order Load-Store Queue for Spatial Computing

An Out-of-Order Load-Store Queue for Spatial Computing

The Advantages and Challenges of Spatial Computation

dMazeRunner

Contact Info

Product

Resources

About