In this paper, we present scalable multi-precision multiplication implementation and scalable multi-precision squaring implementation for 32-bit ARM Cortex-M4 microcontrollers. For efficient computation and scalable functionality, we present optimized Multiplication and ACcumulation (MAC) techniques for the target microcontrollers. In particular, we present the 64-bit wise MAC operation with the Unsigned Long Multiply with Accumulate Accumulate (UMAAL) instruction. The MAC is used to perform column-wise multiplication/squaring (i.e., product-scanning) with general-purpose registers in an optimal way. Second, the squaring algorithm is further optimized through an efficient doubling routine together with an optimized product-scanning method. Finally, the proposed implementations achieved a very small memory footprint and high scalability to cover algorityms ranging from well-known public key cryptography (i.e., Rivest–Shamir–Adleman (RSA) and Elliptic Curve Cryptography (ECC)) to post-quantum cryptography (i.e., Supersingular Isogeny Key Encapsulation (SIKE)). All SIKE round 2 protocols were evaluated with the proposed modular reduction implementations. The results demonstrate that the scalable implementation can achieve the smallest code size together with a reasonable performance.