Centromeres invariably serve as the loci of kinetochore assembly in all eukaryotic cells, but their underlying DNA sequences evolve rapidly. Human centromeres are characterized by their extremely repetitive structures, i.e., higher-order repeats, rendering the region one of the most difficult parts of the genome to assess. Consequently, our understanding of centromere sequence variations across human populations is limited. Here, we analyzed chromosomes 11, 17, and X using long sequencing reads of two European and two Asian genomes, and our results show that human centromere sequences exhibit substantial structural diversity, harboring many novel variant higher-order repeats specific to individuals, while frequent single-nucleotide variants are largely conserved. Our findings add another dimension to our knowledge of centromeres, challenging the notion of stable human centromeres. The discovery of such diversity prompts further deep sequencing of human populations to understand the true nature of sequence evolution in human centromeres.
MainCentromeres have been one of the most mysterious parts of the human genome since they were characterized, in the 1970s, as large tracts of 171 bp strings called alpha-satellite monomers 1, 2 . With a growing body of evidence suggesting their relevance to human diseases as sources of genomic instability or as repositories of haplotypes containing causative mutations 3-8 , it has become more important to investigate the underlying sequence variations in centromeres 9, 10 .Human centromeres have nested repeat structures. Namely, a series of distinctively divergent alpha-satellite monomers compose a larger unit called higher-order repeat (HOR) unit, and copies of an HOR unit are tandemly arranged thousands of times to form large, homogeneous HOR arrays. While HOR units are chromosome-specific and consist of two to 34 alphasatellite monomers, copies of an HOR unit are almost identical (95 ∼ 100%) within a chromosome (Figure 1a) [11][12][13][14][15][16][17] . The total HOR array length in each chromosome is known to differ dramatically among individuals 7, 18 or across human populations 19,20 . In addition to array length, other types of variation are known to exist within an HOR array, such as structurally variant HORs that consist of different numbers and/or types of alpha-satellite monomers 20-23 and also single-nucleotide variations (SNVs) between the HOR units 20, 24, 25 . One of the major driving forces leading to such centromeric variation has been thought to be structural alterations such as unequal crossovers and/or gene conversions 26,27 .Previous studies have investigated centromeric sequence variations via traditional approaches such as restriction enzymes sensitive to alpha-satellite monomers, Southern blotting, or analysis of k-mers unique to centromeres in short reads obtained in the 1000 Genomes Project 28 , but their observations have remained indirect and were confined to specific types of variations due to technological limitations.Recently, the advent of long-rea...