2021
DOI: 10.1107/s2052252521011088
|View full text |Cite
|
Sign up to set email alerts
|

findMySequence: a neural-network-based approach for identification of unknown proteins in X-ray crystallography and cryo-EM

Abstract: Although experimental protein-structure determination usually targets known proteins, chains of unknown sequence are often encountered. They can be purified from natural sources, appear as an unexpected fragment of a well characterized protein or appear as a contaminant. Regardless of the source of the problem, the unknown protein always requires characterization. Here, an automated pipeline is presented for the identification of protein sequences from cryo-EM reconstructions and crystallographic data. The met… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

6
52
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
6

Relationship

1
5

Authors

Journals

citations
Cited by 46 publications
(64 citation statements)
references
References 68 publications
6
52
0
Order By: Relevance
“…To identify these unknown fibrils, MS-based proteomics of fractionated tissue from FTLD-TDP case 1 (Table S5) was used to identify 600 potential protein candidates. The well-resolved sidechain density of the doublet fibril was mapped to this reduced proteome using cryoID (Ho et al, 2020) and findMySequence (Chojnowski et al, 2021) software, which unambiguously identified an amino acid sequence corresponding to residues 120-254 of TMEM106B (see STAR Methods). Fibrils formed by TMEM106B(120-154) exhibit ultrastructural polymorphism with a common protofilament (Figure 4A) existing as either a singlet fibril (Figure 5A and 5C) or in a 2-fold symmetrical juxtaposition as a doublet fibril (Figure 3A).…”
Section: Resultsmentioning
confidence: 99%
See 1 more Smart Citation
“…To identify these unknown fibrils, MS-based proteomics of fractionated tissue from FTLD-TDP case 1 (Table S5) was used to identify 600 potential protein candidates. The well-resolved sidechain density of the doublet fibril was mapped to this reduced proteome using cryoID (Ho et al, 2020) and findMySequence (Chojnowski et al, 2021) software, which unambiguously identified an amino acid sequence corresponding to residues 120-254 of TMEM106B (see STAR Methods). Fibrils formed by TMEM106B(120-154) exhibit ultrastructural polymorphism with a common protofilament (Figure 4A) existing as either a singlet fibril (Figure 5A and 5C) or in a 2-fold symmetrical juxtaposition as a doublet fibril (Figure 3A).…”
Section: Resultsmentioning
confidence: 99%
“…Tryptic digest-mass spectrometry of a fibril sample extracted from FTLD-TDP type A case 1 identified 600 proteins which we used to help identify the fibril's constituent protein. By cross-referencing the proteins in the sample identified by mass spectrometry to the 2.7 A ˚density map using cryo-ID (Ho et al, 2020) and findMySequence (Chojnowski et al, 2021), we unambiguously determined that the fibril was composed of TMEM106B(120-254). TMEM106B(120-254) sidechains were manually added to the 135 amino acid Ca chain and iteratively improved using real-space refinement in COOT (Emsley et al, 2010).…”
Section: Model Buildingmentioning
confidence: 99%
“…First, to characterize the sequence of a certain map de novo, based on the modeled Al-phafold2 main chain and subsequent search in sequence databases, we have shown that sufficient resolution to optimally fit the full sequence is not a necessary prerequisite, expanding the field of de novo sequence identification to resolutions as low as $4.6 A ˚. Previously, de novo polypeptide chains were modeled in overexpressed or highly purified endogenous species (Ho et al, 2020;Su et al, 2021;Chojnowski et al, 2022), while our work suggests that protein community members can be modeled de novo at near-atomic resolutions despite their substantial complexity and inherent flexibility. Second, initial models for refinement in cryo-EM maps belonging to community members from native cell extracts can be used and then optimized in the experimental densities of protein community members.…”
Section: Discussionmentioning
confidence: 98%
“…Resulting models were fitted into the EM map and refined in real space. The model's backbones and signature's map were then used as input to the findMySequence (Chojnowski et al, 2022) program to identify corresponding sequences in the C. thermophilum proteome, not accounting for the MS data. In both cases, the program selected the E2o sequence, achieving a significantly better score for the corresponding E2o model (E-value 67.9e-30 versus 1.7e-3 for E2b).…”
Section: De Novo Protein Sequence Identification From Reported Cryo-em Maps Is Feasiblementioning
confidence: 99%
See 1 more Smart Citation