The RefSeq project at the National Center for Biotechnology Information (NCBI) maintains and curates a publicly available database of annotated genomic, transcript, and protein sequence records (http://www.ncbi.nlm.nih.gov/refseq/). The RefSeq project leverages the data submitted to the International Nucleotide Sequence Database Collaboration (INSDC) against a combination of computation, manual curation, and collaboration to produce a standard set of stable, non-redundant reference sequences. The RefSeq project augments these reference sequences with current knowledge including publications, functional features and informative nomenclature. The database currently represents sequences from more than 55 000 organisms (>4800 viruses, >40 000 prokaryotes and >10 000 eukaryotes; RefSeq release 71), ranging from a single record to complete genomes. This paper summarizes the current status of the viral, prokaryotic, and eukaryotic branches of the RefSeq project, reports on improvements to data access and details efforts to further expand the taxonomic representation of the collection. We also highlight diverse functional curation initiatives that support multiple uses of RefSeq data including taxonomic validation, genome annotation, comparative genomics, and clinical testing. We summarize our approach to utilizing available RNA-Seq and other data types in our manual curation process for vertebrate, plant, and other species, and describe a new direction for prokaryotic genomes and protein name management.
In April 2008, a nucleotide sequence-based, complete genome classification system was developed for group A rotaviruses (RVs). This system assigns a specific genotype to each of the 11 genome segments of a particular RV strain according to established nucleotide percent cut-off values. Using this approach, the genome of individual RV strains are given the complete descriptor of Gx-P[x]-Ix-Rx-Cx-Mx-Ax-Nx-Tx-Ex-Hx. A Rotavirus Classification Working Group (RCWG) was formed by scientists in the field to maintain, evaluate, and develop the RV genotype classification system, in particular to aid in the designation of new genotypes. Since its conception, the group has ratified 50 new genotypes: as of January 2011, new genotypes for VP7 (G20–G26), VP4 (P[28]–P[35]), VP6 (I12–I16), VP1 (R5–R9), VP2 (C6–C9), VP3 (M7–M8), NSP1 (A15–A16), NSP2 (N6–N9), NSP3 (T8–T12), NSP4 (E12–E14), and NSP5/6 (H7–H11) have been defined for RV strains identified in humans, cows, pigs, horses, mice, South American camelids (guanaco and vicuña), chickens, turkeys, pheasants, and bats. With increasing numbers of complete RV genome sequences becoming available, a standardized RV strain nomenclature system is needed and the RCWG proposes that individual RV strains are named as follows: RV group/species of origin/country of identification/common name/year of identification/G- and P-type. In collaboration with the National Center for Biotechnology Information (NCBI), the RCWG is also working on developing a RV-specific resource for the deposition of nucleotide sequences. This resource will provide useful information regarding RV strains, including but not limited to, the individual gene genotypes, epidemiological, and clinical information. Together, the proposed nomenclature system and the NCBI RV resource will offer highly useful tools for investigators to search for, retrieve, and analyze the ever-growing volume of RV genomic data.
Recent technological innovations have ignited an explosion in virus genome sequencing that promises to fundamentally alter our understanding of viral biology and profoundly impact public health policy. Yet, any potential benefits from the billowing cloud of next generation sequence data hinge upon well implemented reference resources that facilitate the identification of sequences, aid in the assembly of sequence reads and provide reference annotation sources. The NCBI Viral Genomes Resource is a reference resource designed to bring order to this sequence shockwave and improve usability of viral sequence data. The resource can be accessed at http://www.ncbi.nlm.nih.gov/genome/viruses/ and catalogs all publicly available virus genome sequences and curates reference genome sequences. As the number of genome sequences has grown, so too have the difficulties in annotating and maintaining reference sequences. The rapid expansion of the viral sequence universe has forced a recalibration of the data model to better provide extant sequence representation and enhanced reference sequence products to serve the needs of the various viral communities. This, in turn, has placed increased emphasis on leveraging the knowledge of individual scientific communities to identify important viral sequences and develop well annotated reference virus genome sets.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.