In this paper, we propose two scalable architectures (say, Arc and Arc 2 ) that perform the discrete wavelet transform (DWT) of an 0 -sample sequence in only 0 2 clock cycles. Therefore, they are at least twice as fast as the other known architectures. Also, they have an AT 2 parameter that is approximately 1/2 that of already existing devices. This result has been achieved by means of a carefully balanced pipelining, and it has two "faces." First, Arc and Arc 2 can be employed for performing two times faster processing than allowed by other architectures working at the same clock frequency (highspeed utilization). Second, they can be employed even using a two times lower clock frequency but reaching the same performance as other architectures. This second possibility allows for reducing the supply voltage and the power dissipation, respectively, by a factor of two and four with respect to other architectures (low-power utilization).As a final result, we show that a parallel architecture implementing an -tap filter-based DWT with decomposition levels [say, Arc OPT ( )] can be defined, aiming at having an excellent efficiency (say, eff[Arc OPT ( )]) for any value of and . For instance, the average value of eff[Arc OPT ( )] [computed in very wide set 6 of "points" ( )] is 99.1%. The minimum value of eff[Arc OPT ( )] in 6 is 93.8%, and, except for five "points," in all the others, eff[Arc OPT ( )] is not lower than 96.9%.