The advancement in very large scale integration (VLSI) technology and field programmable gate array's (FPGA) parallel constructive nature for digital circuits has made the implementation of finite impulse response (FIR) filters increasingly relevant in real-time. In FIR filters the computational complexity increases with the length of the filter; numerous techniques have been developed to design viable architectures for realizing FIR filters. The multiple input multiple output (MIMO) based parallel FIR filter architecture often requires a large area for realization as the level of parallelism and filter order increases; this study presents an architecturally enhanced, novel parallel architecture for FIR filter design which eliminates the resource dependencies on the level of parallelism. The proposed architecture addresses the issue by limiting the number of multipliers required irrespective of the change in the level of parallelism and further modifies the data flow to improve the iteration period in the proposed design. The level of parallelism (L) is fixed to 8 in the presented study as the value is neither too low nor high; it is important to note that the increase in the level of parallelism will increase the number of samples generated at a given time. The presented work is demonstrated using tap 16 FIR filter implemented on FPGA VIRTEX 4 xc4vsx35-10ff668 and VIRTEX 5 XC5VSX95T-1FF1136 platforms, and filter-order 16 FIR filter is implemented on ARTIX 7 xc7a200t-2fbg676. The functionality is verified using Xilinx ISE 14.7 and Matlab 2018 environments. The post-synthesis results and comparative study show the design validation; the proposed architecture outperforms the benchmark and conventional MIMO-based parallel FIR filter architectures with an improvement in total delay by 70% w.r.t. [2] and 93% w.r.t. [3] and 2% w.r.t. [4] as compared in case studies 1 -3. The reduction in area metric is observed as dipping in the slice requirement by 31%, 47%, and 110% and dipping in the LUT requirement by 58%, 66%, and 65%, further, the delay metric is improved by 49%, 80%, and 67% when compared with MIMO-based parallel architecture of [1] on the same platforms suggested in [2][3][4] respectively.