Self-supervised neural language models with attention have recently been applied to biological sequence data, advancing structure, function and mutational effect prediction. Some protein language models, including MSA Transformer and AlphaFold's EvoFormer, take multiple sequence alignments (MSAs) of evolutionarily related proteins as inputs. Simple combinations of MSA Transformer's row attentions have led to state-of-the-art unsupervised structural contact prediction. We demonstrate that similarly simple, and universal, combinations of MSA Transformer's column attentions strongly correlate with Hamming distances between sequences in MSAs. Therefore, MSA-based language models encode detailed phylogenetic relationships. This could aid them to separate coevolutionary signals encoding functional and structural constraints from phylogenetic correlations arising from historical contingency. To test this hypothesis, we generate synthetic MSAs, either without or with phylogeny, from Potts models trained on natural MSAs. We demonstrate that unsupervised contact prediction is indeed substantially more resilient to phylogenetic noise when using MSA Transformer versus inferred Potts models.