“…Bisheh-Niasar et al [12] deployed 2×2 BU array and improved the access pattern to reduce the computational cycle (i.e., 324 CCs). Meanwhile, Bisheh-Niasar et al [13] employed two configurable BUs in parallel, which required a larger number of CCs (i.e., 474) and performed the NTT computation at low clock frequency. However, our fully pipelined NTT design has smallest CC number and outperforms that of [11], [12], and [13] approximately 1.7×, 7×, and 19.7× acceleration, respectively.…”