Scale-out datacenters mandate high per-server throughput to get the maximum benefit from the large TCO investment. Emerging applications (e.g., data serving and web search) that run in these datacenters operate on vast datasets that are not accommodated by on-die caches of existing server chips. Large caches reduce the die area available for cores and lower performance through long access latency when instructions are fetched. Performance on scale-out workloads is maximized through a modestly-sized last-level cache that captures the instruction footprint at the lowest possible access latency. In this work, we introduce a methodology for designing scalable and efficient scale-out server processors. Based on a metric of performance-density, we facilitate the design of optimal multi-core configurations, called pods. Each pod is a complete server that tightly couples a number of cores to a small last-level cache using a fast interconnect. Replicating the pod to fill the die area yields processors which have optimal performance density, leading to maximum per-chip throughput. Moreover, as each pod is a stand-alone server, scale-out processors avoid the expense of global (i.e., interpod) interconnect and coherence. These features synergistically maximize throughput, lower design complexity, and improve technology scalability. In 20nm technology, scaleout chips improve throughput by 5x-6.5x over conventional and by 1.6x-1.9x over emerging tiled organizations.
Flexible electronics can create lightweight, conformable components that could be integrated into smart systems for applications in healthcare, wearable devices and the Internet of Things. Such integrated smart systems will require a flexible processing engine to address their computational needs. However, the flexible processors demonstrated so far are typically fabricated using low-temperature polysilicon thin-film transistor (TFT) technology, which has a high manufacturing cost, and the processors that have been created with low-cost metal-oxide TFT technology have limited computational capabilities. Here, we report a processing engine that is fabricated with a commercial 0.8 μm metal-oxide TFT technology. We develop a resource-efficient machine learning (ML) algorithm (termed univariate Bayes feature voting classifier) and demonstrate its implementation with hardwired parameters as a flexible processing engine for an odour recognition application. Our flexible processing engine contains around 1,000 logic gates and has a gate density per area that is 20-45 times higher than other digital integrated circuits built with metal-oxide TFTs.Flexible electronic devices are built on substrates such as paper, plastic and metal foil, and use active materials such as organics, metal oxides and amorphous silicon. They offer a number of advantages over traditional silicon devices, including thinness, conformability and low manufacturing costs, and various commercial systems are already available, including organic light emitting diodes, flexible displays and organic photovoltaics. The integration of different flexible components -for instance, printed sensors, organic displays, printed batteries, energy harvesters, memories, antennas, and near field communication or radio frequency identification (RFID) chips -could lead to innovative products such as flexible integrated smart systems [1] for logistics, fast moving consumer goods (FMCG), healthcare, wearables, and the Internet of Things 157 standard ML practice: The dataset is split into training and test datasets. Then, the ML algorithms are 158 trained offline using the training datasets. Once the training is complete, the performance of the ML 159 algorithms with learned parameters are evaluated with the test datasets. We use a 5-fold cross-validation 160 methodology to avoid overfitting. Classification prediction accuracy is used as a metric that is defined as 161 how accurate the prediction is with respect to the ground truth. No visible difference is observed between 162 5-bit and full precision data representations. The best performing ML algorithm is GNB with a prediction 163 accuracy of 92%. b) The 5-bit GNB design variants are compared in terms of gate count and execution 164 time. The three GNB variants are created by either sharing or duplicating the multiply-accumulate (MAC)
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.