Motivation: With a regulatory impact on numerous biological processes, protein phosphorylation is one of the most studied post-translational modifications. Effective computational methods that provide a sequence-based prediction of probable phosphorylation sites are desirable to guide functional experiments or constrain search spaces for proteomics-based experimental pipelines. Currently, the most successful methods train deep learning models on amino acid composition representations. However, recently proposed protein language models provide enriched sequence representations that may contain higher-level pattern information on which more performant phosphorylation site predictions may be based. Results: We explored the applicability of protein language models to general phosphorylation site prediction. We found that training prediction models on top of protein language models yield a relative improvement of between 13.4% and 63.3% in terms of area under the precision-recall curve over the state-of-the-art predictors. Advanced model interpretation and model transferability experiments reveal that across models, protease-specific cleavage patterns give rise to a protease-specific training bias that results in an overly optimistic estimation of phosphorylation site prediction performance, an important caveat in the application of advanced machine learning approaches to protein modification prediction based on proteomics data.