Using nonvolatile memories in memory hierarchy has been investigated to reduce its energy consumption because nonvolatile memories consume zero leakage power in memory cells. One of the difficulties is, however, that the endurance of most nonvolatile memory technologies is much shorter than the conventional SRAM and DRAM technology. This has limited its usage to only the low levels of a memory hierarchy, e.g., disks, that is far from the CPU. In this paper, we study the use of a new type of nonvolatile memories -- the Phase Change Memory (PCM) as the main memory for a 3D stacked chip. The main challenges we face are the limited PCM endurance, longer access latencies, and higher dynamic power compared to the conventional DRAM technology. We propose techniques to extend the endurance of the PCM to an average of 13 (for MLC PCM cell) to 22 (for SLC PCM) years. We also study the design choices of implementing PCM to achieve the best tradeoff between energy and performance. Our design reduced the total energy of an already low-power DRAM main memory of the same capacity by 65%, and energy-delay 2 product by 60%. These results indicate that it is feasible to use PCM technology in place of DRAM in the main memory for better energy efficiency.
By studying the behavior of programs in the SPECint95 suite we observed that six out of eight programs exhibit a new kind of value locality, the frequent value locality, according to which a few values appear very frequently in memory locations and are therefore involved in a large fraction of memory accesses. In these six programs ten distinct values occupy over 50% of all memory locations and on an average account for nearly 50% of all memory accesses during program execution. This observation holds for smaller blocks of consecutive memory locations and the set of frequent values remains quite stable over the execution of the program.In the six benchmarks with frequent value locality, on an average 50% of all cache misses occur during the reading or writing of the ten most frequently accessed values. We propose a new data cache structure, the frequent value cache (FVC), which employs a value-centric approach to caching data locations for exploiting the frequent value locality phenomenon. FVC is a small direct-mapped cache which is dedicated to holding only frequently occurring values. The value-centric nature of FVC enables us to store data in a compressed form where the compression is achieved by encoding the frequent values using a few bits. Moreover this simple compression scheme preserves the random access to data values in a cache line.Our experiments demonstrate that by augmenting a direct mapped cache (DMC) with a direct mapped FVC of size no more than 3 Kbytes we can obtain reductions in miss rates ranging from 1% to 68%. In fact we observed that higher reductions in miss rates can he achieved by augmenting a DMC with a small FVC as opposed to doubling the size of DMC for the 124.m88ksim and 134.perl benchmarks.
GPUs have been widely adopted in data centers to provide acceleration services to many applications. Sharing a GPU is increasingly important for better processing throughput and energy efficiency. However, quality of service (QoS) among concurrent applications is minimally supported. Previous efforts are too coarse-grained and not scalable with increasing QoS requirements. We propose QoS mechanisms for a fine-grained form of GPU sharing. Our QoS support can provide control over the progress of kernels on a per cycle basis and the amount of thread-level parallelism of each kernel. Due to accurate resource management, our QoS support has significantly better scalability compared with previous best efforts. Evaluations show that, when the GPU is shared by three kernels, two of which have QoS goals, the proposed techniques achieve QoS goals 43.8% more often than previous techniques and have 20.5% higher throughput.
Micro-Controller Units (MCUs) are widely adopted ubiquitous computing devices. Due to tight cost and energy constraints, MCUs often integrate very limited internal RAM memory on top of Flash storage, which exposes Flash to heavy write traffic and results in short system lifetime. Architecting emerging Phase Change Memory (PCM) is a promising approach for MCUs due to its fast read speed and long write endurance. However, PCM, especially multi-level cell (MLC) PCM, has long write latency and requires large write energy, which diminishes the benefits of its replacement of traditional Flash. By studying MLC PCM write operations, we observe that writing MLC PCM can take advantages of two write modes --- fast write leaves cells in volatile state, and slow write leaves cells in non-volatile state. In this paper, we propose a compiler directed dual-write (CDDW) scheme that selects the best write mode for each write operation to maximize the overall performance and energy efficiency. Our experimental results show that CDDW reduces dynamic energy by 32.4%(33.8%) and improves performance by 6.3%(35.9%) compared with an all fast(slow) write approach.
A whole program path (WPP) is a complete control flow trace of a program's execution. Recently Larus [18] showed that although WPP is expected to be very large (100's of MBytes), it can be greatly compressed (to 10's of MBytes) and therefore saved for future analysis. While the compression algorithm proposed by Larus is highly effective, the compression is accompanied with a loss in the ease with which subsets of information can be accessed. In particular, path traces pertaining to a particular function cannot generally be obtained without examining the entire compressed WPP representation. To solve this problem we advocate the application of compaction techniques aimed at providing easy access to path traces on a per function basis. We present a WPP compaction algorithm in which the WPP is broken in to path traces corresponding to individual function calls. All of the path traces for a given function are stored together as a block. Ability to construct the complete WPP from individual path traces is preserved by maintaining a dynamic call graph . The compaction is achieved by eliminating redundant path traces that result from different calls to a function and by replacing a sequence of static basic block ids that correspond to a dynamic basic block by a single id. We transform a compacted WPP representation into a timestamped WPP (TWPP) representation in which the path traces are organized from the perspective of dynamic basic blocks. TWPP representation also offers additional opportunities for compaction. Experiments show that our algorithm compacts the WPPs by factors ranging from 7 to 64. At the same time information is organized in a highly accessible form which speeds up the responses to queries requesting the path traces of a given function by over 3 orders of magnitude.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2025 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.