An OS-based alternative to full hardware coherence on tiled CMPs

Fensch, Christian; Cintra, Marcelo

doi:10.1109/hpca.2008.4658652

Cited by 40 publications

(40 citation statements)

References 36 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Under the remote-access framework of standard NUCA designs [7,9], all non-local memory accesses cause a request to be transmitted over the interconnect, the access to be performed in the remote core, and the data (for loads) or acknowledgement (for writes) to be sent back to the requesting core. When a core C executes a memory access for address A, it must first find the home core H for A (e.g., by consulting a mapping table or masking some address bits).…”

Section: A Remote Cache Accessmentioning

confidence: 99%

“…A straightforward approach to removing directories while maintaining cache coherence is to disallow cache line replication across on-chip caches (even L1 caches) and use remote word-level access to load and store remotely cached data [7]: in this scheme, every access to an address cached on a remote core becomes a two-message round trip. Since only one copy is ever cached, however, coherence is trivially ensured.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

The Execution Migration Machine: Directoryless Shared-Memory Architecture

et al. 2015

View full text Add to dashboard Cite

Distributed directory cache coherence protocols for current many-core CMPs are not only difficult and error-prone to implement and verify, but also provide suboptimal performance when a thread requires access to large amounts of data distributed across the chip: the data must be brought to the core where the thread is running, incurring delays and energy costs. In this paper, we propose an approach based on the combination of partial-context thread migration and a directory-free remote access protocol: for these kinds of applications, our architecture can outperform directory-based cache coherence. In addition, unlike with distributed cache coherence protocols, the verification complexity of our architecture does not grow with the number of cores.

show abstract

Section: A Remote Cache Accessmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

The Execution Migration Machine: Directoryless Shared-Memory Architecture

et al. 2015

View full text Add to dashboard Cite

show abstract

“…NUCA architectures divide the address space among the cores in such a way that each address is assigned to a unique home core where the corresponding data can be cached [7], [5]. To read and write data cached in a remote core, the NUCA architectures proposed so far use a remote access mechanism where a request is sent to the home core and the resulting data (or acknowledgement) is sent back to the requesting core.…”

Section: Memory Access Frameworkmentioning

confidence: 99%

“…Under the remote-access framework of standard NUCA designs [7], [5], all non-local memory accesses cause a request to be transmitted over the interconnect network, the access to be performed in the remote core, and the data (for loads) or acknowledgement (for writes) to be sent back to the requesting core. When a core C executes a memory access for address A, it must first find the home core H for A (e.g., by consulting a mapping table or masking some address bits).…”

Section: Remote Cache Accessmentioning

confidence: 99%

“…For such massive multicores, a tiled architecture where each core has its own cache slice has become a popular design. These physically distributed cache slices can form one logically shared cache, known as Non-Uniform Cache Access (NUCA) architecture [7], [5]. In the "pure" form of NUCA where percore caches are fully shared, each cache line corresponds to a unique core where it can be kept on chip, which maximizes effective onchip cache capacity and reducing off-chip access rates.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Thread Migration Prediction for Distributed Shared Caches

Shim

Lis

Khan

et al. 2014

IEEE Comput. Arch. Lett.

View full text Add to dashboard Cite

Abstract-Chip-multiprocessors (CMPs) have become the mainstream parallel architecture in recent years; for scalability reasons, designs with high core counts tend towards tiled CMPs with physically distributed shared caches. This naturally leads to a Non-Uniform Cache Access (NUCA) design, where on-chip access latencies depend on the physical distances between requesting cores and home cores where the data is cached. Improving data locality is thus key to performance, and several studies have addressed this problem using data replication and data migration. In this paper, we consider another mechanism, hardware-level thread migration. This approach, we argue, can better exploit shared data locality for NUCA designs by effectively replacing multiple round-trip remote cache accesses with a smaller number of migrations. High migration costs, however, make it crucial to use thread migrations judiciously; we therefore propose a novel, on-line prediction scheme which decides whether to perform a remote access (as in traditional NUCA designs) or to perform a thread migration at the instruction level. For a set of parallel benchmarks, our thread migration predictor improves the performance by 24% on average over the shared-NUCA design that only uses remote accesses.

show abstract

A real‐time capable coherent data cache for multicores

Pyka

Rohde

Uhrig

2013

Concurrency and Computation

View full text Add to dashboard Cite

In multicore systems, the concurrent access to shared data generates a bottleneck for the system performance. Cache coherence techniques have been introduced to enable fast access while preserving the data coherence, but these coherence protocols are critical in hard real-time systems. Because the frequent inter-cache communication leads to unpredictable interferences between the cores, the system's timing behaviour is hard to analyse. In this paper, we propose a new, hard real-time capable strategy for multicore systems called on-demand coherent cache ODC 2 . The technique is based on marginal hardware extensions compared with noncoherent caches and the use of common synchronisation techniques. ODC 2 provides coherent accesses to cached shared data as well as caching of private data. Because the presented strategy does not induce interferences between local caches, ODC 2 is capable for hard real-time systems. We present an evaluation of performance and scalability of ODC 2 compared with two standard coherence protocols using a bus-based multicore system.

show abstract

An OS-based alternative to full hardware coherence on tiled CMPs

Cited by 40 publications

References 36 publications

The Execution Migration Machine: Directoryless Shared-Memory Architecture

The Execution Migration Machine: Directoryless Shared-Memory Architecture

Thread Migration Prediction for Distributed Shared Caches

A real‐time capable coherent data cache for multicores

Contact Info

Product

Resources

About