“…A series of studies have proved the convergence (Jacot et al, 2018;Li and Liang, 2018;Du et al, 2019;Allen-Zhu et al, 2019b;Zou et al, 2018) and generalization (Allen-Zhu et al, 2019a;Arora et al, 2019a,b;Cao and Gu, 2019) guarantees in the so-called "neural tangent kernel" (NTK) regime, where the parameters stay close to the initialization, and the neural network function is approximately linear in its parameters. A recent line of works (Allen-Zhu and Li, 2019;Bai and Lee, 2019;Allen-Zhu and Li, 2020a,b,c;Li et al, 2020;Cao et al, 2022;Zou et al, 2021;Wen and Li, 2021) studied the learning dynamic of neural networks beyond the NTK regime. It is worthwhile to mention that our analysis of the MoE model is also beyond the NTK regime.…”