IBM zEnterprise redundant array of independent memory subsystem

Meaney, P. J.; Lastras-Montaño, L.A.; Papazova, V. K.; Stephens, E.; Johnson, Jon; Alves, L. C.; O'Connor, J. A.; Clarke, William

doi:10.1147/jrd.2011.2177106

Cited by 33 publications

(31 citation statements)

References 10 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Chipkill improves reliability by interleaving error detection and correction data among multiple DRAM chips [10]. RAIM [11] [15,22,38]) have shown that the OS retiring memory pages after a certain number of errors can eliminate up to 96.8% of detected memory errors. These techniques, though they improve system reliability, still require costly ECC hardware for detecting and identifying memory pages with errors.…”

Section: B Related Workmentioning

confidence: 99%

“…In terms of performance, existing error detection and correction techniques incur a slowdown on each memory access due to their additional circuitry [15,16] [10] 2/8 chips (1/8 chips) 12.5% High RAIM [11] 1/5 modules (1/5 modules) 40.6% High Mirroring [12] 2/8 chips (1/2 modules) 125% Low an additional 10% slowdown due to techniques that operate DRAM at a slower speed to reduce the chances of random bit flips due to electrical interference in higher-density devices that pack more and more cells per square nanometer [17]. In addition, whenever an error is detected or corrected on modern hardware, the processor raises an interrupt that must be serviced by the system firmware (BIOS), incurring up to 100 µs latency-roughly 2000× a typical 50 ns memory access latency [18]-leading to unpredictable slowdowns.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Characterizing Application Memory Error Vulnerability to Optimize Datacenter Cost via Heterogeneous-Reliability Memory

Luo

Govindan

Sharma

et al. 2014

2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks

140

150

View full text Add to dashboard Cite

Abstract-Memory devices represent a key component of datacenter total cost of ownership (TCO), and techniques used to reduce errors that occur on these devices increase this cost. Existing approaches to providing reliability for memory devices pessimistically treat all data as equally vulnerable to memory errors. Our key insight is that there exists a diverse spectrum of tolerance to memory errors in new data-intensive applications, and that traditional one-size-fits-all memory reliability techniques are inefficient in terms of cost. For example, we found that while traditional error protection increases memory system cost by 12.5%, some applications can achieve 99.00% availability on a single server with a large number of memory errors without any error protection. This presents an opportunity to greatly reduce server hardware cost by provisioning the right amount of memory reliability for different applications.Toward this end, in this paper, we make three main contributions to enable highly-reliable servers at low datacenter cost. First, we develop a new methodology to quantify the tolerance of applications to memory errors. Second, using our methodology, we perform a case study of three new dataintensive workloads (an interactive web search application, an in-memory key-value store, and a graph mining framework) to identify new insights into the nature of application memory error vulnerability. Third, based on our insights, we propose several new hardware/software heterogeneous-reliability memory system designs to lower datacenter cost while achieving high reliability and discuss their trade-offs. We show that our new techniques can reduce server hardware cost by 4.7% while achieving 99.90% single server availability.

show abstract

Section: B Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Characterizing Application Memory Error Vulnerability to Optimize Datacenter Cost via Heterogeneous-Reliability Memory

Luo

Govindan

Sharma

et al. 2014

2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks

140

150

View full text Add to dashboard Cite

show abstract

“…inline memory modules (DIMMs) for 3 RAIM [11] protected memory ports, 2 GX++ I/O links, and 5 PCIe x16 Gen3 I/O links. Up to four processor drawers are plugged into a frame, interconnected by passive electric cables that form the off-drawer ABus network.…”

Section: System Topologymentioning

confidence: 99%

The IBM z13 processor cache subsystem

et al. 2015

Self Cite

View full text Add to dashboard Cite

“…The primary developments have been Chipkill [20], SDDC [21], Chipspare [22], and a redundant array of independent memory (RAIM) [23]. The first three implementations are adequately similar to be discussed as one advanced ECC method.…”

Section: Overview Of Error Correctionmentioning

confidence: 99%

“…This interleaving can be performed very quickly in hardware along with the (72, 64) ECC logic, and thus adds negligible latency as compared to a standard ECC protocol. The overhead of an RAIM can also be negligible, due to the use of the advanced ECC techniques detailed above; however, the correction of some hardware errors may incur many microseconds of overhead [23].…”

Section: Overview Of Error Correctionmentioning

confidence: 99%

Resilient Optically Connected Memory Systems Using Dynamic Bit-Steering [Invited]

Brunina

Lai

Liu

et al. 2012

J. Opt. Commun. Netw.

View full text Add to dashboard Cite

Abstract-Resilience is becoming an increasingly critical performance requirement for future large-scale computing systems. In data center and high-performance computing systems with many thousands of nodes, errors in main memory can be a significant source of failures. As a result, large-scale memory systems must employ advanced error detection and correction techniques to mitigate failures. Memory devices are primarily designed for density, optimizing memory capacity and throughput, rather than resilience. A strict focus on memory performance instead of resilience risks undermining the overall stability of next-generation computers. In this work, we leverage an optically connected memory system to optimize both memory performance and resilience. A multicast-capable optical interconnection network replaces the traditional electronic bus between a processor and its main memory, allowing for a novel error-correction technique based on dynamic bit-steering. As compared to an electronically connected approach, we demonstrate significantly higher memory bandwidths and reduced latencies, in addition to a 700× improvement in resilience.

show abstract

IBM zEnterprise redundant array of independent memory subsystem

Cited by 33 publications

References 10 publications

Characterizing Application Memory Error Vulnerability to Optimize Datacenter Cost via Heterogeneous-Reliability Memory

Characterizing Application Memory Error Vulnerability to Optimize Datacenter Cost via Heterogeneous-Reliability Memory

The IBM z13 processor cache subsystem

Resilient Optically Connected Memory Systems Using Dynamic Bit-Steering [Invited]

Contact Info

Product

Resources

About