Thread Migration Prediction for Distributed Shared Caches

Shim, Keun Sup; Lis, Mieszko; Khan, Omer; Devadas, Srinivas

doi:10.1109/l-ca.2012.30

Cited by 12 publications

(10 citation statements)

References 13 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Thread migration may be achieved at the user level, kernel level, or application level. (Shim, Lis, Khan, & Devadas, 2014) considered a mechanism, hardware-level thread migration. They argued that the method has the capability to better exploit of shared data locality for NUCA (Non-Uniform Cache Architecture) designs by adequately supplanting multiple round-trip remote cache accesses by fewer migrations.…”

Section: Thread Migrationmentioning

confidence: 99%

A State of Art Survey for OS Performance Improvement

Haji¹,

Zeebaree²,

Jacksi³

et al. 2018

SJUOZ

View full text Add to dashboard Cite

Through the huge growth of heavy computing applications which require a high level of performance, it is observed that the interest of monitoring operating system performance has also demanded to be grown widely. In the past several years since OS performance has become a critical issue, many research studies have been produced to investigate and evaluate the stability status of OSs performance. This paper presents a survey of the most important and state of the art approaches and models to be used for performance measurement and evaluation. Furthermore, the research marks the capabilities of the performance-improvement of different operating systems using multiple metrics. The selection of metrics which will be used for monitoring the performance depends on monitoring goals and performance requirements. Many previous works related to this subject have been addressed, explained in details, and compared to highlight the top important features that will very beneficial to be depended for the best approach selection.

show abstract

Section: Thread Migrationmentioning

confidence: 99%

A State of Art Survey for OS Performance Improvement

Haji¹,

Zeebaree²,

Jacksi³

et al. 2018

SJUOZ

View full text Add to dashboard Cite

show abstract

“…Therefore, our migration predictor focuses on detecting those. Compared to the predictor presented in [16], which only supports full-context migration, we further reduce migration costs by sending only a part of the register file when a thread migrates (usually, only some of the registers are used between the time the thread migrates out of its native core and the time it returns). With the deadlock-free migration framework of [5], the native-core register file remains intact even if a thread migrates away, because its context it is not used by any other guest threads.…”

Section: Thread Migration Predictormentioning

confidence: 99%

“…Thread migration has also been used to provide memory coherence among per-core caches [10] using a deadlock-free finegrained thread migration protocol [5]; we adopt the same protocol for our hybrid framework. Although a migration predictor that decides between migrations and remote accesses is introduced in [16], it does not address the overhead of high network traffic for thread migration. This paper proposes a novel migration predictor that supports partial context migration, improving both performance and network traffic.…”

Section: Related Workmentioning

confidence: 99%

The Execution Migration Machine: Directoryless Shared-Memory Architecture

et al. 2015

Self Cite

View full text Add to dashboard Cite

Distributed directory cache coherence protocols for current many-core CMPs are not only difficult and error-prone to implement and verify, but also provide suboptimal performance when a thread requires access to large amounts of data distributed across the chip: the data must be brought to the core where the thread is running, incurring delays and energy costs. In this paper, we propose an approach based on the combination of partial-context thread migration and a directory-free remote access protocol: for these kinds of applications, our architecture can outperform directory-based cache coherence. In addition, unlike with distributed cache coherence protocols, the verification complexity of our architecture does not grow with the number of cores.

show abstract

“…Although our ISA allows the programmer to directly specify whether the instruction should migrate or execute via remote cache access, in general this decision can be dynamic and dependent on the phase of the program; therefore, EM² relies on an automatic hardware migration predictor [19] in each tile.…”

Section: E Migration Decision Schemementioning

confidence: 99%

“…While the predictor described in [19] only supports fullcontext migration, the migration predictor of EM² further supports stack-based partial context migration. Each predictor entry consists of a tag for the PC and the transfer sizes for the main and auxiliary stacks upon migrating a thread.…”

Section: E Migration Decision Schemementioning

confidence: 99%

Design tradeoffs for simplicity and efficient verification in the Execution Migration Machine

Shim

Lis

Cho

et al. 2013

2013 IEEE 31st International Conference on Computer Design (ICCD)

Self Cite

View full text Add to dashboard Cite

Abstract-As transistor technology continues to scale, the architecture community has experienced exponential growth in design complexity and significantly increasing implementation and verification costs. Moreover, Moore's law has led to a ubiquitous trend of an increasing number of cores on a single chip. Often, these large-core-count chips provide a shared memory abstraction via directories and coherence protocols, which have become notoriously error-prone and difficult to verify because of subtle data races and state space explosion. Although a very simple hardware shared memory implementation can be achieved by simply not allowing ad-hoc data replication and relying on remote accesses for remotely cached data (i.e., requiring no directories or coherence protocols), such remote-access-based directoryless architectures cannot take advantage of any data locality, and therefore suffer in both performance and energy.Our recently taped-out 110-core shared-memory processor, the Execution Migration Machine (EM²), establishes a new design point. On the one hand, EM² supports shared memory but does not automatically replicate data, and thus preserves the simplicity of directoryless architectures. On the other hand, it significantly improves performance and energy over remoteaccess-only designs by exploiting data locality at remote cores via fast hardware-level thread migration. In this paper, we describe the design choices made in the EM² chip as well as our choice of design methodology, and discuss how they combine to achieve design simplicity and verification efficiency. Even though EM² is a fairly large design-110 cores using a total of 357 million transistors-the entire chip design and implementation process (RTL, verification, physical design, tapeout) took only 18 manmonths.

show abstract

Thread Migration Prediction for Distributed Shared Caches

Cited by 12 publications

References 13 publications

A State of Art Survey for OS Performance Improvement

A State of Art Survey for OS Performance Improvement

The Execution Migration Machine: Directoryless Shared-Memory Architecture

Design tradeoffs for simplicity and efficient verification in the Execution Migration Machine

Contact Info

Product

Resources

About