Lattice-based cryptography performs polynomial multiplication using the Number Theoretic Transform (NTT), in order to reduce the polynomial multiplication complexity from O(n 2 ) to O(n log n). NTT has been in the center of investigation in cryptography space, as it is applied in many cryptography schemes such as hash functions, homomorphic encryption, keyencapsulation mechanisms, and digital signatures. A common approach for rapid production of hardware designs commences from semi-automatic software production, as supported by the Xilinx High-Level Synthesis (HLS) toolchain or similar tools. Most of the times this approach requires careful modifications (e.g. code modification, loop reordering, loop flattening, removing dependencies, loop pipelining, loop unrolling) in order to achieve a design with performance comparable to a Register-Transfer Level (RTL) hand-crafted design. In this paper a design solution is proposed that solves the data and loop-carry dependencies of the Cooley-Tukey NTT algorithm, by assisting the HLS synthesizer to produce efficient designs, in terms of latency and resources. The proposed work has been evaluated using the Dilithium digital-signature scheme NTT version (n = 256, Q of 23 bits), and is shown to achieve a 20-50% improvement in terms of latency (without really affecting the resources) compared to other existing HLS-based NTT solutions in the literature.