CryoPPP: A Large Expert-Labelled Cryo-EM Image Dataset for Machine Learning Protein Particle Picking

Dhakal, Ashwin; Gyawali, Rajan; Wang, Liguo; Cheng, Jianlin

doi:10.1101/2023.02.21.529443

Cited by 7 publications

(13 citation statements)

References 59 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…After CryoSegNet was trained and validated on the training/validation, we blindly benchmarked it on a test dataset consisting of thousands of labeled cryo-EM micrographs of 7 different protein types from the CryoPPP 4 dataset. The particles picked by CryoSegNet were compared with the ground truth coordinates of the expert-labeled particles.…”

Section: Resultsmentioning

confidence: 99%

“…Thus, it is imperative to determine the protein structure for understanding protein function and interaction, studying their roles in the diseases, and accelerating the design of drugs. X-ray crystallography, nuclear magnetic resonance (NMR), and cryo-EM 4,5 are three main experimental techniques to determine protein structures. Among them, cryo-EM is the cutting-edge technique for solving the structure of large protein complexes.…”

Section: Introductionmentioning

confidence: 99%

“…With advancements in hardware and software tools [8][9][10][11][12] , numerous semi-automated or automated approaches varying from traditional computational methods to modern deep learning techniques have been proposed to streamline the cryo-EM processing and particle picking. Conventional computer vision methods like edge detection, blob detection and template matching 4 are still widely used for particle picking. However, due to the low SNR of cryo-EM micrographs, these techniques are susceptible to picking ice patches, carbon areas and aggregated particles, resulting in a high number of false positives.…”

Section: Introductionmentioning

confidence: 99%

“…Fig 4. Comparison results for viewing direction, resolution, and 3D density map of particles picked by crYOLO, Topaz and CryoSegNet.…”

mentioning

confidence: 99%

See 3 more Smart Citations

Accurate cryo-EM protein particle picking by integrating the foundational AI image segmentation model and specialized U-Net

Gyawali,

Dhakal,

Wang

et al. 2023

Preprint

Self Cite

View full text Add to dashboard Cite

Cryo-electron microscopy (cryo-EM) has revolutionized the field of structural biology by enabling the precise determination of large protein structures. Picking protein particles in cryo-EM micrographs (images) is a crucial step in the cryo-EM-based structure determination. However, existing methods trained on a limited amount of cryo-EM data still cannot accurately pick protein particles from complex, noisy, and heterogenous cryo-EM images. The general foundational artificial intelligence (AI)-based image segmentation model such as the Segment Anything Model (SAM) trained on huge amounts of general image data cannot segment protein particles well because their training data do not include cryo-EM images. In this work, we present a novel approach (CryoSegNet) of integrating the power of the encoder and decoder-based architecture of an attention-gated U-shape network (U-Net) specially designed and trained for cryo-EM particle picking and the SAM. The U-Net is first trained on a large cryo-EM image dataset and then used to generate input from original cryo-EM images for SAM to make particle pickings. CryoSegNet shows both high precision and recall in segmenting protein particles from cryo-EM micrographs, irrespective of protein type, shape, and size. On several independent datasets of various protein types, CryoSegNet outperforms two top machine learning particle pickers crYOLO and Topaz as well as SAM itself. The average resolution of density maps reconstructed from the particles picked by CryoSegNet is 3.05 Å, 15% better than 3.60 Å of Topaz and 49% better than 5.96 Å of crYOLO. Therefore, CryoSegNet can be applied to enhance the resolution of protein structures constructed from both existing and new cryo-EM data.

show abstract

Section: Resultsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

“…Fig 4. Comparison results for viewing direction, resolution, and 3D density map of particles picked by crYOLO, Topaz and CryoSegNet.…”

mentioning

confidence: 99%

See 2 more Smart Citations

Accurate cryo-EM protein particle picking by integrating the foundational AI image segmentation model and specialized U-Net

Gyawali,

Dhakal,

Wang

et al. 2023

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…Predicting rare GO terms is analogous to the few-shot learning problems [6] in various domains like computer vision[7, 8, 9], and natural language processing(NLP). For example, in the classification task of named entity typing[10, 11] in NLP, assigning rare entity types to entity names pose a similar challenge, due to the increasing size and granularity of entity types.…”

Section: Introductionmentioning

confidence: 99%

Improving protein function prediction by learning and integrating representations of protein sequences and function labels

Boadu,

Cheng

2024

Preprint

Self Cite

View full text Add to dashboard Cite

Motivation: As fewer than 1% of proteins have protein function information determined experimentally, computationally predicting the function of proteins is critical for obtaining functional information for most proteins and has been a major challenge in protein bioinformatics. Despite the significant progress made in protein function prediction by the community in the last decade, the general accuracy of protein function prediction is still not high, particularly for rare function terms associated with few proteins in the protein function annotation database such as the UniProt. Results: We introduce TransFew, a new transformer model, to learn the representations of both protein sequences and function labels (Gene Ontology (GO) terms) to predict the function of proteins. TransFew leverages a large pre-trained protein language model (ESM2-t48) to learn function relevant representations of proteins from raw protein sequences and uses a biological natural language model (BioBert) and a graph convolutional neural network-based autoencoder to generate semantic representations of GO terms from their textual definition and hierarchical relationships, which are combined together to predict protein function via the cross-attention. Integrating the protein sequence and label representations not only enhances overall function prediction accuracy over the existing methods, but substantially improves the accuracy of predicting rare function terms with limited annotations by facilitating annotation transfer between GO terms. Availability: https://github.com/BioinfoMachineLearning/TransFew Supplementary information: Supplementary data are available.

show abstract

CryoSegNet: accurate cryo-EM protein particle picking by integrating the foundational AI image segmentation model and attention-gated U-Net

Gyawali,

Dhakal,

Wang

et al. 2024

Briefings in Bioinformatics

View full text Add to dashboard Cite

Picking protein particles in cryo-electron microscopy (cryo-EM) micrographs is a crucial step in the cryo-EM-based structure determination. However, existing methods trained on a limited amount of cryo-EM data still cannot accurately pick protein particles from noisy cryo-EM images. The general foundational artificial intelligence–based image segmentation model such as Meta’s Segment Anything Model (SAM) cannot segment protein particles well because their training data do not include cryo-EM images. Here, we present a novel approach (CryoSegNet) of integrating an attention-gated U-shape network (U-Net) specially designed and trained for cryo-EM particle picking and the SAM. The U-Net is first trained on a large cryo-EM image dataset and then used to generate input from original cryo-EM images for SAM to make particle pickings. CryoSegNet shows both high precision and recall in segmenting protein particles from cryo-EM micrographs, irrespective of protein type, shape and size. On several independent datasets of various protein types, CryoSegNet outperforms two top machine learning particle pickers crYOLO and Topaz as well as SAM itself. The average resolution of density maps reconstructed from the particles picked by CryoSegNet is 3.33 Å, 7% better than 3.58 Å of Topaz and 14% better than 3.87 Å of crYOLO. It is publicly available at https://github.com/jianlin-cheng/CryoSegNet

show abstract

CryoPPP: A Large Expert-Labelled Cryo-EM Image Dataset for Machine Learning Protein Particle Picking

Cited by 7 publications

References 59 publications

Accurate cryo-EM protein particle picking by integrating the foundational AI image segmentation model and specialized U-Net

Accurate cryo-EM protein particle picking by integrating the foundational AI image segmentation model and specialized U-Net

Improving protein function prediction by learning and integrating representations of protein sequences and function labels

CryoSegNet: accurate cryo-EM protein particle picking by integrating the foundational AI image segmentation model and attention-gated U-Net

Contact Info

Product

Resources

About