Next generation Intel&amp;#x00AE; micro-architecture (Nehalem) clocking architecture

Kurd, Nasser; Douglas, Jonathan; Mosalikanti, Praveen; Kumar, Rajesh

doi:10.1109/vlsic.2008.4585952

Cited by 38 publications

(13 citation statements)

References 3 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Assuming a 20% protocol overhead for ethernet, the PCIe v4 bus can be saturated by 9 teamed 40GbE connections. Second, representative of a more aggressively designed system that uses near-future technology, we consider a design that employs Quick Path Interconnect (QPI) [29] to connect CPUs to GPUs inside the server. Assuming 12 GPUs inside a 2-socket server, 6 point-to-point QPI links would be needed in each socket.…”

Section: Addressing the Bandwidth Bottleneckmentioning

confidence: 99%

DjiNN and Tonic

Hauswald

Kang

Laurenzano

et al. 2015

Proceedings of the 42nd Annual International Symposium on Computer Architecture

124

View full text Add to dashboard Cite

As applications such as Apple Siri, Google Now, Microsoft Cortana, and Amazon Echo continue to gain traction, webservice companies are adopting large deep neural networks (DNN) for machine learning challenges such as image processing, speech recognition, natural language processing, among others. A number of open questions arise as to the design of a server platform specialized for DNN and how modern warehouse scale computers (WSCs) should be outfitted to provide DNN as a service for these applications.In this paper, we present DjiNN, an open infrastructure for DNN as a service in WSCs, and Tonic Suite, a suite of 7 endto-end applications that span image, speech, and language processing. We use DjiNN to design a high throughput DNN system based on massive GPU server designs and provide insights as to the varying characteristics across applications. After studying the throughput, bandwidth, and power properties of DjiNN and Tonic Suite, we investigate several design points for future WSC architectures. We investigate the total cost of ownership implications of having a WSC with a disaggregated GPU pool versus a WSC composed of homogeneous integrated GPU servers. We improve DNN throughput by over 120⇥ for all but one application (40⇥ for Facial Recognition) on an NVIDIA K40 GPU. On a GPU server composed of 8 NVIDIA K40s, we achieve near-linear scaling (around 1000⇥ throughput improvement) for 3 of the 7 applications. Through our analysis, we also find that GPU-enabled WSCs improve total cost of ownership over CPU-only designs by 4-20⇥, depending on the composition of the workload.

show abstract

Section: Addressing the Bandwidth Bottleneckmentioning

confidence: 99%

DjiNN and Tonic

Hauswald

Kang

Laurenzano

et al. 2015

Proceedings of the 42nd Annual International Symposium on Computer Architecture

124

View full text Add to dashboard Cite

show abstract

“…Wide links tend to be source-synchronous; however, the delay between the clock and data paths can vary over time, making re-synchronization of the clock and data at the receiver necessary. As data rates increase, the mismatch between the data paths themselves has become large enough to require per-pin phase alignment [1]. Thus, a small, low-power CDR system is an important component of such interconnects.…”

Section: Introductionmentioning

confidence: 99%

All-digital CDR for high-density, high-speed I/O

Loh

Emami

2010

2010 Symposium on VLSI Circuits

View full text Add to dashboard Cite

A novel all-digital CDR for source-synchronous links, and its implementation in 90nm CMOS, is presented. A phase alignment technique with ping-pong action between two clock phases is used. The system is implemented in static CMOS logic, occupies 0.234 mm 2 and dissipates 16.6 mW at 6 Gb/s, demonstrating BER <10-13 with PRBS-7 input. The compactness and all-static-CMOS nature of the system make it suitable for use in high-speed I/Os requiring per-pin synchronization. (Keywords: CDR, static CMOS, all-digital) Introduction Most modern high-speed interconnect relies on both high data rates per pin and parallelism. Wide links tend to be source-synchronous; however, the delay between the clock and data paths can vary over time, making re-synchronization of the clock and data at the receiver necessary. As data rates increase, the mismatch between the data paths themselves has become large enough to require per-pin phase alignment [1]. Thus, a small, low-power CDR system is an important component of such interconnects.This paper presents a novel all-digital CDR system for source-synchronous links (Fig. 1). By taking a digital approach, this design avoids the increasing size, power and complexity overheads faced by analog techniques in highlyscaled CMOS processes. Except for the front-end senseamplifiers (StrongARM latches) the system is implemented entirely using static CMOS logic gates and the synchronization algorithm is synthesized from HDL into standard cells. Therefore, the design is highly portable and customizable, and its performance scales with the digital circuitry fed by the link. Finally, it collects data that can provide diagnostics for the link without extra hardware, useful for on-chip self-test and calibration.Principle of Operation The typical CDR uses a 2x oversampled data-clock/edgeclock technique and a PLL or a DLL. In this system, the edge clock is repurposed into a 'search-clock', not fixed at 90° relative to the data clock, but free to move within 2 unitintervals (UI). This 2 UI delay is generated by an 'open' delay line that is slowly and digitally calibrated. The samples produced by the search-clock are compared with those produced by the data-clock, generating match/mismatch (M/MM) data. As the search-clock sweeps through 2 UI, the M/MM information is collected into a 'signature', which can be thought of as a binary reduction of an eye diagram (Fig. 2). By filtering the raw signature, the system can identify the middle of the eye, where the search-clock will be positioned to recover the data at the end of the sweep. At this point the function of the search-and data-clocks is switched and a new sweep cycle, with the old data-clock now acting as the searchclock, starts. This ping-pong action overcomes the key limitation of traditional delay-line-based systems; allowing the data phase to swap from one UI to an adjacent one between updates enables the realization of an infinite delay range. The 2 UI delay can be calibrated by ensuring that the distance between the end of one 'eye opening' a...

show abstract

“…Today's microprocessors and system-on-chip (SOC) designs incorporate multiple phase-locked loops (PLLs) and delay-locked loops (DLLs) to satisfy clocking requirements for their various sub-systems and I/O interfaces [1]. To increase battery life of mobile products and also to enable small form factors/low cost thermal solutions, different system power states require individual blocks of the product to go into deep power down mode.…”

Section: Introductionmentioning

confidence: 99%

Fast lock scheme for phase-locked loops

Bashir

Ivatury

et al. 2009

2009 IEEE Custom Integrated Circuits Conference

View full text Add to dashboard Cite

This paper describes a fast lock scheme for phaselocked loops (PLLs). The proposed scheme utilizes mostly digital logic and control to achieve significant reduction in PLL lock acquisition time, which enables dynamic power cycling for various sub-systems on SOC designs. Multiple Self-Bias PLLs having fast lock schemes were designed to operate at VCO frequencies from 1.6GHz to 5GHz, and fabricated using 65nm CMOS process. Silicon measurements indicate up to 75% reduction in worst-case PLL lock times over the device operating conditions.

show abstract

Next generation Intel® micro-architecture (Nehalem) clocking architecture

Cited by 38 publications

References 3 publications

DjiNN and Tonic

DjiNN and Tonic

All-digital CDR for high-density, high-speed I/O

Fast lock scheme for phase-locked loops

Contact Info

Product

Resources

About