2020
DOI: 10.21203/rs.3.rs-54568/v1
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Comprehensive Analysis of Non Redundant Protein Database

Abstract: Background: Scientists around the world use NCBI’s non-redundant (NR) database to identify the taxonomic origin and functional annotation of their favorite protein sequences using BLAST. Unfortunately, due to the exponential growth of this database, many scientists do not have a good understanding of the contents of the NR database. There is a need for tools to explore the contents of large biological datasets, such as NR, to better understand the assumptions and limitations of the data they contain. Results: … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1

Citation Types

0
3
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
7

Relationship

0
7

Authors

Journals

citations
Cited by 7 publications
(3 citation statements)
references
References 22 publications
0
3
0
Order By: Relevance
“…To remove all the redundant proteins, the CD-Hit-h analysis approach was utilized. Furthermore, the non-redundant proteins were considered for further processing [46].…”
Section: Cd-hit Analysis (Cluster Data With High Identity and Tolerance)mentioning
confidence: 99%
“…To remove all the redundant proteins, the CD-Hit-h analysis approach was utilized. Furthermore, the non-redundant proteins were considered for further processing [46].…”
Section: Cd-hit Analysis (Cluster Data With High Identity and Tolerance)mentioning
confidence: 99%
“…GeneMarkS software was used to predict the protein-coding genes of the bacterial genome [34]. Comparing the protein sequence of the predicted gene with the NR (Non-Redundant Protein Database) [35], COG (Cluster of Orthologous Groups of proteins), KEGG (Kyoto Encyclopedia of Genes and Genomes) [36], eggNOG (evolutionary genealogy of genes: Non-supervised Orthologous Groups) [37], Swiss-Prot [38] and GO (Gene Ontology) [39] databases was performed by Diamond blast p. For this analysis, the comparison result with the highest score was selected for annotation and the following cutoff values were applied (e-value <1e-6 and amino acid sequence identity of at least 40 %). Orthologous average nucleotide identity (ANI) [40] and digital DNA–DNA hybridization (dDDH) [41] values were calculated using the ANI calculator tool from the EzBioCloud (www.ezbiocloud.net/tools/ani) [42] and the Genome-to-Genome Distance Calculator (http://ggdc.dsmz.de/ggdc) [43], respectively.…”
Section: Genome Featuresmentioning
confidence: 99%
“…Protein sequence information is easily deduced from DNA using the genetic code. For most discovered proteins this type of information is known: the most comprehensive non-redundant protein sequence database currently contains 174 million sequences [ 16 ] and is doubling in size every 28 months [ 17 ]. However, determining protein structure is a complex process and thus there are considerably fewer structures available—approximately 175,000 structures in the Protein Data Bank [ 18 ].…”
Section: Literature Reviewmentioning
confidence: 99%