In this paper, a novel internal folded hardware-efficient architecture of multi-level 2-D 9/7 discrete wavelet transform (DWT) is proposed. For multi-level DWT, the unfolded structure is more extensively used compared with the folded structure, because of its low memory consumption and low time delay. However, a set of input data valid every few clock cycles caused the mismatch between clock and data in the unfolded structure. The mismatch usually needs to be solved by multi-clock or complex data adjustment, which increases the consumption of hardware resources and the complexity of the overall system. To solve the above problem of the unfolded structure, we adjust the data input timing by using a single clock domain and folding the DWT architecture of different levels in varying degrees, according to their own clock-to-data ratios. For an image of size of N × N pixels and 3-level DWT, the proposed architecture requires only 6N words temporal memory. For 3-level DWT with an image of size 512 × 512 pixels, the hardware estimation and comparison of the existing architectures show that, the hardware estimation result shows at least 30.6% area-delay-product (ADP) decrease, and at least 22.4% transistor-delay-product (TDP) decrease for S = 8, and 25.77% transistor-delay-product (TDP) decrease for S = 16.