In deep submicron technology, wire delay is no longer negligible and is gradually becoming a dominant factor of system performance. Several state-of-the-art architectural synthesis flows have already adopted the distributed register architecture to cope with the increasing wire delay by allowing multicycle communication. In this paper, we formulate channel and register allocation within a refined regular distributed register architecture, named RDR-GRS, as a problem of simultaneous data transfer routing and scheduling for minimizing global interconnect resources. We also present an innovative algorithm with both spatial and temporal considerations. It features both a concentration-oriented path router gathering wire-sharable data transfers and a channelbased time scheduler resolving contentions for wires in a channel, which are in spatial and temporal domain, respectively. The experimental results show that the proposed algorithm can significantly outperform existing related works.
I IntroductionAs proceeding into the deep-submicron (DSM) technology era, interconnect delay is becoming inevitable due to resistancecapacitance delay, coupling effect, inductance, multiple-gigahertz operating frequency, and so on [1]. In architectural synthesis, the maximum sum of delay of both the functional units (FUs) and the associated wires decide the system speed. If the synthesis flow still neglects the delays introduced by long wires (especially for global interconnects), the serious impacts of long wires after physical floorplanning are very likely to worsen the whole system performance due to unexpected larger clock cycle time. To solve this problem, [2] [3][4] propose synthesis flows to estimate long interconnect delays more accurately by applying preliminary floorplanning and obtain better synthesis results.Typically, centralized register (CR) architecture is presumed in high-level synthesis. In a CR-based architecture, an FU is expected to access any register within one clock cycle. Though the device speed generally increases as the manufacturing process advances, the wire delay does not scale as well as the feature size. Consequently, global wire delay gradually dominates and significantly lengthens the cycle time. Hence, [5][6][7][8][9][10]propose distributed register (DR) architectures to overcome this issue. In a DR-based architecture, the whole system is partitioned into several clusters and each cluster contains its own local FUs and registers. As a result, the inter-cluster interconnect delay can be isolated from the intra-cluster delay. The latter includes the local wire delay within the same cluster and is supposed shorter than a single cycle, while the former is the global data transfer delay between different clusters and is allowed to be completed in multiple cycles. Accordingly, the DR architecture can not only alleviate the increase of cycle time due to the long wire delay but enable simultaneous computation and communication.Though allowing multicycle global data transfer can reduce the impact on system spe...