Abstract-We present the details of our energy-efficient asynchronous floating-point multiplier (FPM). We discuss design trade-offs of various multiplier implementations. A higher radix array multiplier design with operand-dependent carrypropagation adder and low handshake overhead pipeline design is presented, which yields significant energy savings while preserving the average throughput. Our FPM also includes a hardware implementation of denormal and underflow cases. When compared against a custom synchronous FPM design, our asynchronous FPM consumes 3X less energy per operation while operating at 2.3X higher throughput. To our knowledge, this is the first detailed design of a high-performance asynchronous IEEE-754 compliant double-precision floating-point multiplier.Keywords-Floating point arithmetic; asynchronous logic circuits; very-large-scale integration; pipeline processing I. INTRODUCTION Energy-efficient floating-point computation is important for a wide range of applications. Traditionally, VLSI designers primarily relied on CMOS technology and voltage scaling to reduce power consumption [4]. With the transistor threshold voltage fixed [10], V DD has been scaling very slowly if at all, which means all performance improvements come at an increased energy consumption. Furthermore, process variations in deep sub-micron range have made devices far less robust, which is increasingly making it difficult for synchronous designers to overcome the problems associated with clock skew rates and clock distribution [6]. The findings of a recent in-depth study, to explore and devise ways to further scale supercomputer petaFLOP performance by 1000X, indicate the inadequacy of current design practices and technologies to achieve the desired throughput within a sustainable power budget [1]. This underscores a pressing need for alternate design practices, to reduce energy consumption for floating-point computations while preserving robust behavior in advanced technology nodes.At the other end of the spectrum, embedded systems that have traditionally been considered low performance are demanding higher and higher throughput for the same power budget to support compute-intensive floating-point applications that improve the user experience. Since these applications have to be deployed on portable devices with limited batterylife, it is critical that we develop energy-efficient floatingpoint hardware for these embedded systems, not simply high performance floating-point hardware.The IEEE 754 standard [19] for binary floating-point arithmetic provides a precise specification of floating-point number formats, computation operations, and exceptions and their handling. The combination of a vast range of inputs, special cases, and rounding modes makes the hardware implementation of fully IEEE 754 standard compliant floating-point arithmetic a very challenging task. Ignoring certain aspects of the standard can lead to unexpected consequences in the context of numerical algorithms. Hence, most floating-point hardware is IEEE-comp...