A Practical Data Classification Framework for Scalable and High Performance Chip-Multiprocessors

Li, Yong; Melhem, Rami; Jones, Alex K.

doi:10.1109/tc.2013.161

Cited by 5 publications

(5 citation statements)

References 33 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Like our study, there exist some prior studies in the literature to classify data blocks as private or shared for different purposes in CMPs, such as reducing coherence overhead or the access latency to distributed caches. Hardavellas et al [17] and Li et al [23] tried to categorize data blocks and keep private data blocks in the nonuniform distributed shared cache [nonuniform cache access (NUCA)] slice of the requesting core, where the access latency depends on the physical distance between the core demanding data and the L2 cache slice storing the data. The primary aim of these two studies is to reduce NUCA access latency by employing intelligent placement, migration, and replication mechanisms.…”

Section: Related Work and Motivationmentioning

confidence: 99%

“…Moreover, the private data detection mechanisms in this paper are quite different from ones used in these studies. While in the study [17] cache access patterns are classified via the OS, Li et al [23] try to detect private data offline with compiler assistance. As mentioned in some previous studies [3], [10], we believe that compared to offline, we can detect more private data blocks at runtime.…”

Section: Related Work and Motivationmentioning

confidence: 99%

See 1 more Smart Citation

Classifying Data Blocks at Subpage Granularity With an On-Chip Page Table to Improve Coherence in Tiled CMPs

Soltaniyeh

Kadayif

Öztürk

2018

IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst.

View full text Add to dashboard Cite

As shown in some prior studies, a significant percentage of data blocks accessed in parallel codes are private, and not keeping track of those blocks can improve the effectiveness of directory structures in Chip multiprocessors (CMPs). In this paper, we have two major contributions. First, we showed that compared to the classification of cache blocks at page granularity, data block classification (DBC) at subpage level helps to detect considerably more private data blocks. Based on this idea, we propose two different approaches for enhancing the effectiveness of directory caches in tiled CMPs. In the first approach, which is called quasi-dynamic subpage level DBC (QDBC), a data block is assumed to be private from the beginning of the program execution and stays private as long as the corresponding subpage is accessed by only one core. Our second approach, which is called dynamic subpage level DBC, turns a data block into private again after all blocks within the corresponding subpage are evicted from private cache hierarchy. Memory block classification at subpage level, however, may increase the frequency of the operating system involvement in updating the maintenance bits in page table entries. To overcome this, we propose, as a second contribution, a distributed table called as on-chip page table (o-CPT), which stores recently accessed page translations in the system. Our simulation results show that, compared to page level data classification, QDBC and DBC approaches relying on the o-CPT can detect significantly more private data blocks and considerably improve system performance.

show abstract

Section: Related Work and Motivationmentioning

confidence: 99%

Section: Related Work and Motivationmentioning

confidence: 99%

Classifying Data Blocks at Subpage Granularity With an On-Chip Page Table to Improve Coherence in Tiled CMPs

Soltaniyeh

Kadayif

Öztürk

2018

IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst.

View full text Add to dashboard Cite

show abstract

“…Some prior studies exploited private data detection to enable high performance for many-core architecture by mitigating the overhead of managing coherence. The detection of private data might be done offline with compiler assistance [15]. Although, this approach does not incur any runtime overhead or any extra hardware, however, there is a limitation on the amount of the private data which can be statically detected.…”

Section: Key Observationmentioning

confidence: 99%

Boosting performance of directory-based cache coherence protocols with coherence bypass at subpage granularity and a novel on-chip page table

Soltaniyeh

Kadayif

Öztürk

2016

Proceedings of the ACM International Conference on Computing Frontiers

View full text Add to dashboard Cite

Chip multiprocessors (CMPs) require effective cache coherence protocols as well as fast virtual-to-physical address translation mechanisms for high performance. Directory-based cache coherence protocols are the state-of-the-art approaches in many-core CMPs to keep the data blocks coherent at the last level private caches. However, the area overhead and high associativity requirement of the directory structures may not scale well with increasingly higher number of cores. As shown in some prior studies, a significant percentage of data blocks are accessed by only one core, therefore, it is not necessary to keep track of these in the directory structure. In this study, we have two major contributions. First, we show that compared to the classification of cache blocks at page granularity as done in some previous studies, data block classification at subpage level helps to detect considerably more private data blocks. Consequently, it reduces the percentage of blocks required to be tracked in the directory significantly compared to similar page level classification approaches. This, in turn, enables smaller directory caches with lower associativity to be used in CMPs without hurting performance, thereby helping the directory structure to scale gracefully with the increasing number of cores. Memory block classification at subpage level, however, may increase the frequency of the Operating System's (OS) involvement in updating the maintenance bits belonging to subpages stored in page table entries, nullifying some portion of performance benefits of subpage level data classification. To overcome this, we propose a distributed on-chip page table as a our second contribution.

show abstract

“…Morrigan sets the access bit of all prefetched pages since the x86 memory consistency model dictates that all TLB prefetches are obliged to do so [48]. Therefore, Morrigan does not complicate TLB shootdowns [53,57,88,181,288] because the information about the prefetched instruction PTEs is conveyed to the OS as usual. Regarding the impact on the page replacement policy, a prefetch is harmful to the page replacement policy if it is evicted from the TLB PB without providing any hit and does not belong to the active footprint of the application.…”

Section: Page Replacement Policy and Tlb Shootdownsmentioning

confidence: 99%

“…The PPM module does not introduce new security vulnerabilities since it solely leverages the page size information which is part of the address translation metadata available after the TLB access. An adversary could not use events such as context-switches and TLB shootdowns [53,57,88,181,288] to violate the security guarantees of PPM; this would be possible if PPM was storing the page size information into a data structure and that data structure was not flushed upon TLB shootdowns and context switches.…”

Section: Securitymentioning

confidence: 99%

Advanced hardware prefetching in virtual memory systems

Vavouliotis

View full text Add to dashboard Cite

(English) Despite groundbreaking technological innovations, the disparity between processor and memory speeds (known as Memory Wall) is still a major performance obstacle for modern systems. Hardware prefetching is a latency-tolerance technique that has proven successful at shrinking this bottleneck. Nearly all real-world µarchitectural designs employ various prefetchers. Consequently, hardware prefetching attracts a lot of research attention. Virtual memory has been vital for the success of computing due to its programmability and security benefits. However, virtual memory does not come for free since each memory access requires a translation from the virtual to the physical address space that incurs high latency and energy overheads. To alleviate these overheads, a hardware cache, named Translation Lookaside Buffer(TLB), is typically employed to store the most recently used translations. However, TLBs are limited in capacity, thus there are not adequate for assuring high performance. Processor vendors address the need for fast address translation by providing dedicated support for virtual memory (e.g., multi-level TLBs, multiple page sizes). Despite the existence of such support, the advent of applications with large data and code footprints aggravates the pressure placed on the virtual memory subsystem, resulting in frequent page walks that deteriorate system's performance. This dissertation argues that hardware prefetching can attenuate the Memory Wall bottleneck in virtual memory systems. To support our claim, we design and propose fully-legacy preserving TLB prefetching schemes and exploit address translation metadata that are available at the µarchitecture to improve the effectiveness of the prefetchers operating in the physical address space. To reduce the overheads of frequent TLB misses due to data accesses, we propose a solution that consists of the Sampling-Based Free TLB Prefetching (SBFP) scheme and the Agile TLB Prefetcher(ATP). SBFP exploits the locality in the last-level of the page table to enhance the performance of TLB prefetching. ATP combines three prefetch engines while disabling TLB prefetching during phases that does not provide benefits. Across different benchmark suites, we show that ATP combined with SBFP improves performance over the best performing prior TLB prefetcher while reducing the page walk references to the memory hierarchy. Next, we argue that instruction address translation is an emerging bottleneck in servers. To support our claim, we characterize the TLB behavior of server workloads and provide evidence that instruction address translation is a bottleneck in servers. To attenuate this bottleneck, we propose Morrigan, the first instruction TLB prefetcher. Morrigan combines a sequential prefetcher with an ensemble of hardware Markov prefetchers that build variable length Markov chains out of the instruction TLB miss stream while using a new frequency-based replacement policy. Across a set of industrial server workloads, Morrigan provides great performance gains while eliminating the majority of the demand page walks for instruction accesses. Our last contribution improves the efficacy of cache prefetchers operating in the physical address space by exploiting modern support for large pages. We propose the Page-size Propagation Module(PPM), a µarchitectural scheme that transmits the page size information to the lower-level cache prefetchers and enables safe prefetching beyond 4KB physical page boundaries. We further design a module comprised of two prefetchers that both exploit PPM but drive prefetching decisions assuming different page sizes. Our experiments reveal that the proposed page size exploitation techniques provide great performance enhancements on various state-of-the-art cache prefetchers. ¿he proposals of this dissertation are fully legacy-preserving, do not call for disruptive changes, do not require any OS involvement, and constitute practical solutions to real-world bottlenecks. (Español) La diferencia de velocidad entre los procesadores y las memorias (conocido como Memory Wall) es uno de los mayores cuellos de botella en los sistemas actuales. Los prefetchers hardware son una técnica para tolerar la latencia de memoria que ha demostrado ser efectiva para aliviar este cuello de botella. Prácticamente todas las microarquitecturas actuales incluyen varios prefetchers. La memoria virtual ha sido vital para el éxito de los sistemas de computación debido a los beneficios que ofrece en términos de programabilidad y seguridad. Sin embargo, la memoria virtual tiene un coste alto en latencia y en energía, ya que cada acceso a memoria requiere una traducción del espacio de direcciones físico al virtual. Para aliviar estos costes, una caché hardware llamada Translation Lookahead Buffer (TLB) guarda las traducciones más recientes. Uno de los problemas de las TLBs es que tienen una capacidad limitada, lo que hace que no puedan garantizar un alto rendimiento. Los procesadores comerciales se aseguran de que la traducción de direcciones sea rápida al incluir soporte hardware dedicado a la memoria virtual (como TLBs multinivel o varios tamaños de página). Incluso con este soporte hardware, la creciente popularidad de aplicaciones con grandes cantidades de datos e instrucciones ha aumentado la presión sobre el subsistema de memoria virtual, resultando en accesos frecuentes a la tabla de páginas que deterioran el rendimiento del sistema. Esta tesis pretende atenuar el cuello de botella del Memory Wall utilizando hardware prefetchers en sistemas de memoria virtual. Para reducir las penalizaciones de los fallos frecuentes de TLB para accesos de datos, proponemos una solución llamada Sampling Based Free TLB Prefetching (SBFP) combinada con un Agile TLB Prefetcher (ATP). SBFP se beneficia de la localidad en el último nivel de la tabla de páginas para mejorar el prefetching en TLBs. ATP combina tres motores de prefetching y deshabilita el prefetching de TLB en fases donde no aporta ningún beneficio. Para varios juegos de prueba, demostramos que la combinación de ATP con SBFP mejora el rendimiento respecto al mejor prefetcher de TLB propuesto anteriormente, a la vez que reduce la cantidad de accesos a la tabla de páginas. También identificamos que la traducción de direcciones en instrucciones es un cuello de botella emergente en servidores, y lo demostramos caracterizando el comportamiento de las TLBs en estos sistemas. Para aliviar este cuello de botella proponemos Morrigan, el primer prefetcher de instrucciones para TLBs. Morrigan combina un prefetcher secuencial con un conjunto de Markov prefetchers que construyen cadenas de Markov de diferente longitud de la secuencia de fallos de instrucciones en la TLB, y también utiliza una nueva política de reemplazo basada en la frecuencia. En tareas de servidores industriales, Morrigan consigue amplios beneficios en rendimiento y elimina la mayoría de accesos a la tabla de páginas para accesos a instrucciones. La última contribución mejora la eficacia de los prefetchers de cachés que operan en el espacio de direcciones físico utilizando el soporte hardware para páginas grandes. Para ello proponemos el Page-size Propagation Module (PPM), una técnica microarchitectural que transmite el tamaño de página a los prefetchers de las cachés de último nivel para poder hacer prefetching de forma segura sobrepasando los límites de las páginas físicas de 4KB. También diseñamos otro módulo que consiste en dos prefetchers conectados al PPM y que controla las decisiones de los prefetchers asumiendo diferentes tamaños de páginas. Nuestros experimentos muestran que las técnicas propuestas para sacar beneficio del tamaño de las páginas en varios prefetchers de cachés de último nivel propuestas en el estado del arte consiguen grandes mejoras de rendimiento.

show abstract

A Practical Data Classification Framework for Scalable and High Performance Chip-Multiprocessors

Cited by 5 publications

References 33 publications

Classifying Data Blocks at Subpage Granularity With an On-Chip Page Table to Improve Coherence in Tiled CMPs

Classifying Data Blocks at Subpage Granularity With an On-Chip Page Table to Improve Coherence in Tiled CMPs

Boosting performance of directory-based cache coherence protocols with coherence bypass at subpage granularity and a novel on-chip page table

Advanced hardware prefetching in virtual memory systems

Contact Info

Product

Resources

About