We propose a novel discrete wavelet transform (DWT) architecture which is fully scalable, flexible, and modular. This architecture is bit serial, and therefore, has low hardware complexity and low power requirement. Nevertheless, because of its particular structure, it operates on-the-fly (i.e., it does not require wait cycles between consecutive input samples). Moreover, a very small hardware overhead can upgrade the architecture to compute also the inverse DWT ("double-face" utilization). Hardware complexity and computing performance are analyzed in detail.