2022
DOI: 10.1093/nar/gkac1052
|View full text |Cite
|
Sign up to set email alerts
|

UniProt: the Universal Protein Knowledgebase in 2023

Abstract: The aim of the UniProt Knowledgebase is to provide users with a comprehensive, high-quality and freely accessible set of protein sequences annotated with functional information. In this publication we describe enhancements made to our data processing pipeline and to our website to adapt to an ever-increasing information content. The number of sequences in UniProtKB has risen to over 227 million and we are working towards including a reference proteome for each taxonomic group. We continue to extract detailed a… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

4
879
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
7
2

Relationship

1
8

Authors

Journals

citations
Cited by 3,143 publications
(1,227 citation statements)
references
References 40 publications
4
879
0
Order By: Relevance
“…To test whether the language model’s understanding of proteins generalizes from natural to de novo space, it is critical that the model did not see de novo proteins at train time. To this end, we first remove all sequences from ESM2’s train set labeled as “artificial construct” on the UniProt (57) website, when 2021_04 was the most recent release (1,027 total proteins). To guard against mislabeled proteins, and to further remove sequences in the train set which may bear similarity to the target set, we additionally perform Jackhmmer (58) searches of each de novo sequence against UniRef100 with flags 1 {, and remove all hits returned by the tool from ESM2’s training set (58,462 proteins).…”
Section: Methodsmentioning
confidence: 99%
See 1 more Smart Citation
“…To test whether the language model’s understanding of proteins generalizes from natural to de novo space, it is critical that the model did not see de novo proteins at train time. To this end, we first remove all sequences from ESM2’s train set labeled as “artificial construct” on the UniProt (57) website, when 2021_04 was the most recent release (1,027 total proteins). To guard against mislabeled proteins, and to further remove sequences in the train set which may bear similarity to the target set, we additionally perform Jackhmmer (58) searches of each de novo sequence against UniRef100 with flags 1 {, and remove all hits returned by the tool from ESM2’s training set (58,462 proteins).…”
Section: Methodsmentioning
confidence: 99%
“…In the case of Fig. 4D, each of the 25k free generations and the ≈15k natural proteins from (59) was queried against the sequences in AlphaFold DB (37), which comprise UniProt (57). Because all sequences in this database have a structure predicted by AlphaFold, searching against this database enables comparison of predicted structure at scale.…”
Section: Methodsmentioning
confidence: 99%
“…ProteinMPNN was trained on static protein crystal structures in the PDB. ProtXLNet was trained on the UniRef100 database (Consortium, 2022). Both of these models capture the joint probability of protein sequences and structures as selected for by nature.…”
Section: Discussionmentioning
confidence: 99%
“…Other classes of proteins with special structural properties are covered by the new Amylograph ( 29 ) which curates information on amyloid-amyloid interactions and the returning PhaSepDB ( 30 ) for proteins that can participate in phase separation, now doubled in size and with much more detailed annotations. Other notable updating databases include the Biological Magnetic Resonance Data Bank ( 31 ); the eggNOG resource for comparative genomics ( 32 ) which more than doubles the number of species covered; the InterPro protein family compilation ( 33 ) which benefits from an improved interface that includes features inspired by the now-retiring Pfam website ( 34 ); and UniProt ( 35 ) which also has a redesigned website. The UniProt paper features interesting updates on the parallel and complementary annotation activities centred respectively on community curation and automatic rules-based methods.…”
Section: New and Updated Databasesmentioning
confidence: 99%