To move High-Performance Computing (HPC) closer to forward operating environments and missions, the Army Research Laboratory is developing approaches using hybrid, asymmetric core computing. By blending capabilities found in Graphics Processing Units (GPUs) and traditional von Neumann multicore Central Processing Units (CPUs), approaches are being developed and optimized to provide at or near real-time processing speeds for research project applications. Algorithms are designed to partition work to resources best designed to handle the processing load. The use of commodity resources allows the design to be flexible throughout the life cycle without the costly and time-consuming delays associated with Application-Specific Integrated Circuit (ASIC) development. This paradigm allows for rapid technology transfer to end users. In this paper, we describe a synchronous impulse reconstruction radar imaging algorithm that has been designed for hybrid CPU-GPU processing. We discuss various optimizations such as asynchronous task partitioning between the CPU and GPU as well as data movement reduction. We also discuss analysis and design of the algorithms within the context of two programming models: NVIDIA's CUDA and AMD's ATI Brook+. Finally, we report on the speedup achieved by this approach that allowed us to take a code once restricted to postprocessing and transform it into one that exceeds realtime performance requirements.
No abstract
No abstract
A FULLY STATIC 16K CMOS RAM using a high density Si-gate bulk CMOS process and a circuit with a basic six-transistor CMOS RAM cell will be reported. The memory offers a 9511s typical access time, 200mW active power dissipation and standby power of less than 1pW.Double polysilicon MOS technology has allowed static RAMS to reach 16K chips with polysilicon load devices. However, their standby current should be relatively high, since the load resistance value must be kept low to compensate for process variations. On the other hand, a six-transistor CMOS RAM cell has many advantages, especially in wide operational margin and low standby power. But, the cell area tends to be large, since the cell consists of four NMOS transistor cross-coupled flipflops with two PMOS load transistors. Therefore, an extremely tight layout rule must be used to achieve a 16K CMOS RAM for the desired device performance.High density of the CMOS memory has been achieved by selective scaling of device parameters and fully utilized dry etching processes'. Typically, the gate oxide thickness and the effective channel length are 7 0 0 8 and 2.4pm, respectively, for both P and NMOS transistors. The minimum contact hole width is 2pm being formed by reactive ion etching technology. The complete 16,384.b CMOS RAM* (Figure 1 ) contains 1.03 x lo5transistors. The memory cell is layed out in 1122pm2 (33 x 34p); and the die measures 5.06 x 5.77mm (199 x 227mils) which fits into a standard 24-pin plastic package. Table 1 summarizes typical characteristics. The device is organized as a 2048 word x 8b RAM. All 1/0 levels are TTL compatible and the RAM operates from a single 5 % ' supply. The device is pin compatible with standard 16K EPROMs*, providing another board design flexibility.The memory block diagram is shown in F i g u r e . The device has two chip enables CE1, CE2 and write enable WE.&a input/output buffers are controlled by thes%nals.CE2 selects active and standby modes. In w r i t e x c l e s z signal need not be a clock pulse, provided that either CE1 or CE2 is clocked. In other words, the write operation can be performed in at least three different modes.address circuit and a high-speed sense amplifier. Figure 3 shows the predecoding circuit for a pair of row address signals. The circuit operates with the same number of address inputs as a conventional one-step decoding, but with much higher speed resulting from its larger transistor conductance for the final decoding. The high speed sense amplifier, shown in Figure 4, is connected to a pair of bit lines through a preamplifier. With A fast access, typically 95ns, was achieved by a predecoded *2716. ' N O Z W J~, H., Nishimura, S., Horiike, Y., Okumura, K., Iizuka, H. and Kohyama S., "High Density CMOS Processing for a 16-Kbit RAM", IEEE IEDM Digest; Dec., 1979.feedback paths to the sources of the cross-coupled NMOS transistors in the NAND-type differential circuit, a sensitive and fast static sense amplifier was achieved. Figure 5. Typical access time, tACC, ranges from 9511s at 5 v to 75ns a...
FIGURE 5-Oscillogaph of address input and data output waveforms.[Right]FIGURE 2-Block diagram of the 256Kb CMOS RAM.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2025 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.