The SARC Architecture

Ramírez, Alex; Cabarcas, Felipe; Juurlink, Ben; Mesa, Mauricio Alvarez; Sánchez, Friman; Azevedo, Arnaldo; Meenderinck, Cor; Ciobanu, Catalin Bogdan; Isaza, Sebastian; Gaydadjiev, Georgi

doi:10.1109/mm.2010.79

Cited by 47 publications

(43 citation statements)

References 13 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The PRF registers are multidimensional, with arbitrary sizes and can be created / resized at runtime. Previous studies ( [4], [14]) have demonstrated that PRFs suit computationally intensive workloads such as Floyd, the Conjugate Gradient (CG) method and dense matrix multiplication. Moreover, PRFs could improve performance and efficiency in state of the art many-core computers, potentially saving area and power as shown in [5].…”

Section: Introductionmentioning

confidence: 99%

“…Furthermore, PRFs allow performance benefits when compared to the Cell processor for Floyd and the main kernel of the CG Method -sparse matrix vector multiplication [4]. The PRF programming interface allows high performance dense matrix multiplication with at least 35 times less instructions than a hand-crafted version for the Cell BE [14]. One of the objectives of the PRF, as part of the Scalable ARChitecture (SARC) project [14], is multi-core scalability.…”

Section: Introductionmentioning

confidence: 99%

“…The PRF programming interface allows high performance dense matrix multiplication with at least 35 times less instructions than a hand-crafted version for the Cell BE [14]. One of the objectives of the PRF, as part of the Scalable ARChitecture (SARC) project [14], is multi-core scalability. A CG case study evaluated the PRF based system scalability in a heterogeneous multi-core architecture and showed CG acceleration by two orders of magnitude when using up to 256 PRF instances, with 32 vector lanes each.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Scalability Study of Polymorphic Register Files

Ciobanu

Kuzmanov

Gaydadjiev

2012

2012 15th Euromicro Conference on Digital System Design

Self Cite

View full text Add to dashboard Cite

Abstract-We study the scalability of multi-lane 2D Polymorphic Register Files (PRFs) in terms of clock cycle time, chip area and power consumption. We assume an implementation which stores data in a 2D array of linearly addressable memory banks, and consider one single-view and four suitable multi-view parallel access schemes which cover all basic access patterns commonly used in scientific and multimedia applications. The PRF design features 2 read and 1 write ports, targeting the TSMC 90nm ASIC technology. We consider three PRF sizes -32KB, 128KB and 512KB and four multi-lane configurations -8 / 16 / 32 and 64 lanes. Synthesis results suggest that the clock frequency varies between 500MHz for a 512KB PRF with 64 vector lanes and 970Mhz for a 32KB / 8-lanes case. Estimated power consumption ranges from less than 300mW (dynamic) and 10mW (leakage) for our 8-lane, 32KB PRF up to 8.7W (dynamic) and 276mW (leakage) for a 512KB with 64 lanes. We also show the correlation among the storage capacity, the number of lanes, and the chip overall area. Furthermore, we also investigated customized addressing functions. Our experimental results suggest up to 21% increase of the clock frequency, and up to 39% combinational hardware area reduction (nearly 10% of the total area) compared to our straightforward implementations. Concerning power, we reduce dynamic power with up to 31% and leakage with nearly 24%.

show abstract

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Scalability Study of Polymorphic Register Files

Ciobanu

Kuzmanov

Gaydadjiev

2012

2012 15th Euromicro Conference on Digital System Design

Self Cite

View full text Add to dashboard Cite

show abstract

“…Alternatively, the address translation mechanism can be augmented with a few extra bits that explicitly determine whether an address region contains cacheable or directly-addressed (scratchpad) data 1 , as shown in Figure 1. This is important when remote scratchpad regions are addressed, so that the hardware accesses them remotely, rather than locally caching them.…”

Section: Memory Access Semantics: Cache Scratchpad Communicationmentioning

confidence: 99%

Explicit Communication and Synchronization in SARC

et al. 2010

View full text Add to dashboard Cite

SARC merges cache controller and network interface functions by relying on a single hardware primitive: each access checks the tag and the state of the addressed line for possible occurrence of events that may trigger responses like coherence actions, RDMA, synchronization, or configurable event notifications. The fully virtualized and protected user-level API is based on specially marked lines in the scratchpad space that respond as command buffers, counters, or queues. The runtime system maps communication abstractions of the programming model to data transfers among local memories using remote write or read DMA and into task synchronization and scheduling using notifications, counters, and queues. The on-chip network provides efficient communication among these configurable memories, using advanced topologies and routing algorithms, and providing for process variability in NoC links. We simulate benchmark kernels on a full-system simulator to compare speedup and network traffic against cache-only systems with directory-based coherence and prefetchers. Explicit communication provides 10 to 40% higher speedup on 64 cores, and reduces network traffic by factors of 2 to 4, thus economizing on energy and power; lock and barrier latency is reduced by factors of 3 to 5. EXPLICIT COMMUNICATION AND NETWORK INTERFACE EVOLUTIONInterprocessor communication (IPC) is the basis of parallel processing. IPC can be implicit, when the addresses supplied by the software do not identify physical data locations or (time of) movement, or it can be explicit, when software (the application, or compiler, or runtime system) is able to also indicate physical placement or transfers, besides specifying computation on data. The SARC architecture [1], supports both implicit IPC, through cache coherence, for ease of programming, and explicit IPC, through scratchpad memories and remote store instructions or remote DMA operations, to be used by software whenever possible for achieving scalable performance.In order to hide IPC latency, when using implicit communication, we need large issue windows in out-of-order-execution processors, or sophisticated data prefetchers, or both. Explicit communication has the potential to better hide IPC latency, in those cases when software knows better than hardware what transfers need to take place and when. Remote store instructions, to addresses that indicate proximity to the consumer, when that is known at production time, will transfer data at the earliest possible time; hardware should coalesce writes to adjacent targets into few network packets, and the processor should not wait for the arrival acknowledgments. Remote direct memory access (RDMA) is the other method for explicit communication, in cases that require either reads -when the consumer is unknown or unavailable at production time-or multi-word writes -to achieve good coalescence.Traditional systems viewed networks as external (slow) devices, provided DMA in the network interface (NI), and interacted to it through (slow) input/output (I...

show abstract

“…In this work, we analyze the performance of such accelerators in a heterogeneous multicore processor with specialized workers -the SARC architecture [16]. Moreover, we consider critical parameters such as the available memory bandwidth and the memory latency.…”

Section: Introductionmentioning

confidence: 99%

Scalability Evaluation of a Polymorphic Register File: A CG Case Study

Ciobanu

Martorell

Kuzmanov

et al. 2011

Architecture of Computing Systems - ARCS 2011

Self Cite

View full text Add to dashboard Cite

Abstract. We evaluate the scalability of a Polymorphic Register File using the Conjugate Gradient method as a case study. We focus on a heterogeneous multi-processor architecture, taking into consideration critical parameters such as cache bandwidth and memory latency. We compare the performance of 256 Polymorphic Register File-augmented workers against a single Cell PowerPC Processor Unit (PPU). In such a scenario, simulation results suggest that for the Sparse Matrix Vector Multiplication kernel, absolute speedups of up to 200 times can be obtained. Moreover, when equal number of workers in the range 1-256 is employed, our design is between 1.7 and 4.2 times faster than a Cell PPU-based system. Furthermore, we study the memory latency and cache bandwidth impact on the sustainable speedups of the system considered. Our tests suggest that a 128 worker configuration requires the caches to deliver 1638.4 GB/sec in order to preserve 80% of its peak speedup.

show abstract

The SARC Architecture

Cited by 47 publications

References 13 publications

Scalability Study of Polymorphic Register Files

Scalability Study of Polymorphic Register Files

Explicit Communication and Synchronization in SARC

Scalability Evaluation of a Polymorphic Register File: A CG Case Study

Contact Info

Product

Resources

About