ABSTRACT. The required hardware overhead should be more severe when the DFE is designed in parallel. In this paper, we propose two new approaches to implement the DFE when Giga-bit speed is required. The first approach is partial pre-computation, which can trade-off between hardware complexity and computation speed. The second approach is two-stage pre-computation, which can be applied to higher speed applications. We can reduce the hardware overhead to about 2(log 2 M) (-L/2) [5]. Another approach to achieve high speed computation is parallel implementation [6] [7]. We can relax the clock time to N-times for N-parallel implementation. Exploiting both pipeline and parallel processing for high speed applications are straightforward for non-recursive computations. However, recursive computations, such as DFEs, can not be easily pipelined or processed in parallel due to the feedback loops in these filters. For a filter with a loop, retime approach [9] can be used to move the delay elements of a shorter path to a longer path in loops; then, a smaller critical path can be obtained. However, retime approach can not shorten the iteration bound. Moreover, retime approach can not achieve the iteration bound in most cases. In order to achieve iteration bound, we can unroll a loop, which is referred as the unfolding scheme [5] [7], and then apply the retiming approach.However, for gigabits throughput rate in each communication path, all the aforementioned approaches can not achieve the desired speed. Fig. 1 is a conventional DFE architecture. The critical path of this DFE is one multiplier, one slicer, and two adders as drawn in bold line. For Ethernet 10G base LX4 [11], the modulation scheme is 2-PAM; therefore, the multiplier can be replaced by one 2-to-1 multiplexer. The iteration bound reduces to one multiplexer and two adders. This critical path does not meet the required speed, i.e., 0.32 ns. The authors of [1][2] reformulate the FBF as 2 L -to-1 multiplexers. The iteration bound is reduced to one multiplexer. The delay time of one multiplexer is 0.14 ns in UMC 0.18 m process technique. Again, the computation speed of the architecture [1][2] can not provide a delay element with enough margins. Despite that unfolding approach can be employed to achieve the desired throughput rate, the overhead will be extremely large.In this paper, both hardware cost and computation speed are taken into account. Motivated by [1][2], two new approaches are proposed in this paper. The first approach is to pre-compute and sum the partial outputs of the FBF. This approach can be used to trade off between hardware complexity and computation speed. For higher speed applications, the second approach is proposed to reformulate the FBF as two-stage pre-computation. For M-PAM modulations and a L-tap FBF with wordlength W, we can reduce the hardware overhead to about 2(log 2 M) (-L/2) times of [1] [2]. The iteration bound is only 2(log 2 W+2)/L+(log 2 M) multiplexer-delays. Fig. 1 The architecture of a DFE and its iteration bound.
REVIEW OF T...