As the complexity of high-performance microprocessor increases, functional veri cation becomes more and more difcult and RTL simulation emerges as the bottleneck of the design cycle. In this paper, we suggest C language-based design and veri cation methodology to enhance the simulation speed instead of the conventional HDL-based methodologies. RTL C modelStreC describes the cycle-based behaviors of synchronous circuits and is followed by model re ning and optimization using LifeTime AnalyzerLTA and Cleaner. The simulation speed of cycle-based C model makes it possible to test the RTL design with the real-world" application programs in the order-of-magnitude faster speed than the commercial event-driven simulators. Using the proposed functional veri cation methodology, HK486, an intel 80486 -compatible microprocessor was successfully designed and veri ed.
For energy-efficient accelerators in data centers that leverage advances in the performance and energy efficiency of recent algorithms, flexible architectures are critical to support state-of-the-art algorithms for various deep learning tasks. Due to the matrix multiplication units at the core of tensor operations, most recent programmable architectures lack flexibility for layers with diminished dimensions, especially for inferences where a large batch axis is rarely allowed. In addition, exploiting the data reuse inherent within tensor operations for computing a single matrix multiplication is challenging. In this work, an extension of a vector processor in 14 nm is proposed, which is customized to tensor operations. The flexible architecture enables a tensorized loop to support various data layouts and different shapes and sizes of tensor operations. It also exploits all possible data reuse, including input, weight, and output. Based on the tensorized loop, fetch and reduction networks, which unicast or multicast with the ordering of both input data and processing data, can be simplified using a circuit-switching-like network with configured topology and flow control for each tensor operation. Two processing elements can be fused to optimize latency for a large model or can operate individually for throughput. As a result, various state-of-the-art models can be processed efficiently with straightforward compiler optimization, and the highest energy efficiency of 13.4 Inferences/s/W on EfficientNetV2-S is demonstrated.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.