In this paper, a hardware nested looping structure is proposed for the parameterized and embedded DSP core. Zero-overhead looping scheme used will not cause any clock latency during loop execution. An optional buffer memory for the instructions in the loop is used to save much power consumption of the memory accessing during the transaction of the program memory fetching. The size of instruction buffer and nested loop depth are parameterized parameter in our NCU-DSP core design. Design examples show that only 3% hardware overhead for the nested hardware looping.1.