Recently the surprising discovery of Bootstrap Your Own Latent (BYOL) method by Grill et al.shows the negative term in contrastive loss can be removed if we add the so-called prediction head to the network architecture, which breaks the symmetry between the positive pairs. This initiated the research of non-contrastive self-supervised learning. It is mysterious why even when trivial collapsed global optimal solutions exist, neural networks trained by (stochastic) gradient descent can still learn competitive representations and avoid collapsed solutions. This phenomenon is one of the most typical examples of implicit bias in deep learning optimization, and its underlying mechanism remains little understood to this day.In this work, we present our empirical and theoretical discoveries about the mechanism of prediction head in non-contrastive self-supervised learning methods. Empirically, we find that when the prediction head is initialized as an identity matrix with only its off-diagonal entries being trained, the network can learn competitive representations even though the trivial optima still exist in the training objective. Moreover, we observe a consistent rise and fall trajectory of off-diagonal entries during training. Our evidence suggests that understanding the identity-initialized prediction head is a good starting point for understanding the mechanism of the trainable prediction head.Theoretically, we present a framework to understand the behavior of the trainable, but identity-initialized prediction head. Under a simple setting, we characterized the substitution effect and acceleration effect of the prediction head during the training process. The substitution effect happens when learning the stronger features in some neurons can substitute for learning these features in other neurons through updating the prediction head. And the acceleration effect happens when the substituted features can accelerate the learning of other weaker features to prevent them from being ignored. These two effects together enable the neural networks to learn all the features rather than focus only on learning the stronger features, which is likely the cause of the dimensional collapse phenomenon. To the best of our knowledge, this is also the first end-to-end optimization guarantee for non-contrastive methods using nonlinear neural networks with a trainable prediction head and normalization.