A lightweight block cipher PIPO-64/128 was presented in ICISC'2020. PIPO of the 8-bit unit using an unbalanced-bridge S-box showed better performance than other lightweight block cipher algorithms on an 8-bit AVR environment. So far, optimization methods for implementing PIPO have been proposed in various environments; however, no optimization research has been conducted for two popular 32-bit based processors: ARM Cortex-M4 and RISC-V. Since RISC-V and ARM Cortex-M series platforms do not support bit-based Single Instruction Multiple Data (SIMD) instructions, several aspects should be considered to apply a forced parallelization strategy. In this article, we discuss the implementation methodology of PIPO for 32-bit RISC-V and ARM Cortex-M4 environments. We optimize the performance of S-Layer via proposed register-scheduling and masking technique while we maintain parallelism to the R-Layer implementation. Moreover, we propose an on-the-fly key scheduling technique for further performance improvement. Finally, compared to the existing reference implementations in RISC-V and ARM Cortex-M4 platforms, when 4 plaintext encrypted simultaneously, our software achieved performance of 229% and 370%, respectively.