High-quality and complete reference genome assemblies are fundamental for the application of genomics to biology, disease, and biodiversity conservation. However, such assemblies are only available for a few non-microbial species 1-4 . To address this issue, the international Genome 10K (G10K) consortium 5,6 has worked over a five-year period to evaluate and develop cost-effective methods for assembling the most accurate and complete reference genomes to date. Here we summarize these developments, introduce a set of quality standards, and present lessons learned from sequencing and assembling 16 species representing major vertebrate lineages (mammals, birds, reptiles, amphibians, teleost fishes and cartilaginous fishes). We confirm that long-read sequencing technologies are essential for maximizing genome quality and that unresolved complex repeats and haplotype heterozygosity are major sources of error in assemblies. Our new assemblies identify and correct substantial errors in some of the best historical reference genomes. Adopting these lessons, we have embarked on the Vertebrate Genomes Project (VGP), an effort to generate high-quality, complete reference genomes for all ~70,000 extant vertebrate species and help enable a new era of discovery across the life sciences.
After nearly two decades of improvements, the current human reference genome (GRCh38) is the most accurate and complete vertebrate genome ever produced. However, no one chromosome has been finished end to end, and hundreds of unresolved gaps persist 1,2 . The remaining gaps include ribosomal rDNA arrays, large near-identical segmental duplications, and satellite DNA arrays. These regions harbor largely unexplored variation of unknown consequence, and their absence from the current reference genome can lead to experimental artifacts and hide true variants when re-sequencing additional human genomes. Here we present a de novo human genome assembly that surpasses the continuity of GRCh38 2 , along with the first gapless, telomere-to-telomere assembly of a human chromosome. This was enabled by high-coverage, ultra-long-read nanopore sequencing of the complete hydatidiform mole CHM13 genome, combined with complementary technologies for quality improvement and validation. Focusing our efforts on the human X chromosome 3 , we reconstructed the ~2.8 megabase centromeric satellite DNA array and closed all 29 remaining gaps in the current reference, including new sequence from the human pseudoautosomal regions and cancer-testis ampliconic gene families (CT-X and GAGE). This complete chromosome X, combined with the ultra-long nanopore data, also allowed us to map methylation patterns across complex tandem repeats and satellite arrays for the first time. These results demonstrate that finishing the human genome is now within reach and will enable ongoing efforts to complete the remaining human chromosomes.Complete, telomere-to-telomere reference assemblies are necessary to ensure that all genomic variants, large and small, are discovered and studied. Currently, unresolved regions of the human genome are defined by multi-megabase satellite arrays in the pericentromeric regions and the rDNA arrays on acrocentric short arms, as well as regions enriched in segmental duplications that are greater than hundreds of kilobases in length and greater than 98% identical between paralogs. Due to their absence from the reference, these repeat-rich sequences are often excluded from contemporary genetics and genomics studies, limiting the scope of association and functional analyses 4,5 . Unresolved repeat sequences also result in unintended consequences such as paralogous sequence variants incorrectly called as allelic v ariants 6 and even the contamination of bacterial gene databases 7 . Completion of the entire human genome is expected to contribute to our understanding of chromosome function 8 and human disease 9 , and a comprehensive understanding of genomic variation will improve the driving technologies in biomedicine that currently use short-read mapping to a reference genome (e.g. RNA-seq 10 , ChIP-seq 11 , ATAC-seq 12 ).The fundamental challenge of reconstructing a genome from many comparatively short sequencing reads-a process known as genome assembly-is distinguishing the repeated sequences from one another 13 . Resolving such r...
1-Aminocyclopropane-1-carboxylic acid (ACC) synthase is the key regulatory enzyme in the biosynthetic pathway of the plant hormone ethylene. The enzyme is encoded by a divergent multigene family in Arabidopsis thaliana, comprising at least five genes, ACS1-5 (Liang, X., Abel, S., Keller, J.A., Shen,N. N.F., and Theologis, A. (1992) Poc. Natl. Acad. Sci. U.S.A. 89, 11046-11050). In etiolated seedlings, ACS4 is specifically induced by indoleacetic acid (IAA). The response to IAA is rapid (within 25 min) and insensitive to protein synthesis inhibition, suggesting that the ACS4 gene expression is a primary response to IAA. The ACS4 mRNA accumulation displays a biphasic dose-response curve which is optimal at 10 microM of IAA. However, IAA concentrations as low as 100 microM are sufficient to enhance the basal level of ACS4 mRNA. The expression of ACS4 is defective in the Arabidopsis auxin-resistant mutant lines axr1-12, axr2-1, and aux1-7. ACS4 mRNA levels are severely reduced in axr1-12 and axr2-1 but are only 1.5-fold lower in aux1-7. IAA inducibility is abolished in axr2-1. The ACS4 gene was isolated and structurally characterized. The promoter contains four sequence motifs reminiscent of functionally defined auxin-responsive cis-elements in the early auxin-inducible genes PS-IAA4/5 from pea and GH3 from soybean. Conceptual translation of the coding region predicts a protein with a molecular mass of 53,795 Da and a theoretical isoelectric point of 8.2. The ACS4 polypeptide contains the 11 invariant amino acid residues conserved between aminotransferases and ACC synthases from various plant species. An ACS4 cDNA was generated by reverse transcriptase-polymerase chain reaction, and the authenticity was confirmed by expression of ACC synthase activity in Escherichia coli.
Understanding the mechanisms driving lineage-specific evolution in both primates and rodents has been hindered by the lack of sister clades with a similar phylogenetic structure having high-quality genome assemblies. Here, we have created chromosome-level assemblies of the Mus caroli and 5 Mus pahari genomes. Together with the Mus musculus and Rattus norvegicus genomes, this set of rodent genomes is similar in divergence times to the Hominidae (human-chimpanzee-gorillaorangutan). By comparing the evolutionary dynamics between the Muridae and Hominidae, we identified punctate events of chromosome reshuffling that shaped the ancestral karyotype of Mus musculus and Mus caroli between 3 to 6 MYA, but that are absent in the Hominidae. In fact, 10Hominidae show between four-and seven-fold lower rates of nucleotide change and feature turnover in both neutral and functional sequences suggesting an underlying coherence to the Muridae acceleration. Our system of matched, high-quality genome assemblies revealed how specific classes of repeats can play lineage-specific roles in related species. For example, recent LINE activity has remodeled protein-coding loci to a greater extent across the Muridae than the 15 Hominidae, with functional consequences at the species level such as reproductive isolation. Furthermore, we charted a Muridae-specific retrotransposon expansion at unprecedented resolution, revealing how a single nucleotide mutation transformed a specific SINE element into an active CTCF binding site carrier specifically in Mus caroli. This process resulted in thousands of novel, species-specific CTCF binding sites. Our results demonstrate that the comparison of 20 matched phylogenetic sets of genomes will be an increasingly powerful strategy for understanding mammalian biology.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.