Spoken keyword spotting has been widely used to facilitate an always-on voice interface in consumer electronics, owing to its simplicity and low latency. Small-footprint keyword spotting based on tiny convolutional neural networks can be implemented on resource-constrained but energy-efficient microcontrollers in real time. However, it is difficult for tiny neural networks to learn the noise-robustness properties essential for successful voice interfaces. To overcome this problem, this study proposes a joint framework of curriculum learning and knowledge distillation for noise-robust small-footprint keyword spotting. The proposed joint framework applies noise mixture curriculum learning to a network that is sufficiently large for learning various noise situations. Subsequently, knowledge distillation is applied to compress the large network into a network small enough to be onboard microcontrollers. To enhance the effectiveness of the joint framework, we propose curriculum learning using a new noise mixture strategy and knowledge distillation with an effective ensemble of neural network snapshots for each curriculum stage. These proposed methods enable the effective learning of noisy situations by large networks and the transfer of noise robustness to small networks. The effectiveness of our joint framework is illustrated using the Google Speech Commands dataset with noise mixtures from various public noise datasets. Our joint framework results in superior performance in noisy situations compared with state-of-the-art noise-robust keyword spotting methods. Therefore, the proposed framework significantly improves the usability of voice interfaces in consumer electronics.