Prokaryotic virus Host Predictor: a Gaussian model for host prediction of prokaryotic viruses in metagenomics

Lu, Congyu; Zhang, Zheng; Cai, Zena; Zhu, Zhaozhong; Qiu, Ye; Wu, Aiping; Jiang, Taijiao; Zheng, Heping; Peng, Yousong

doi:10.1101/2020.12.02.408310

Cited by 8 publications

(21 citation statements)

References 28 publications

(25 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In this section, we will show our experimental results on different datasets and compare CHERRY against the stateof-the-art tools: WIsH [10], PHP [24], VHM-Net [12], VPF-Class [13], vHULK [19], RaFAH [21], and HostG [27].…”

Section: Resultsmentioning

confidence: 99%

“…They also use a deep learning model and finally integrate the results of deep learning model with the Markov model for host prediction. On the other hand, PHP [24] utilizes the k-mer frequency, which can reflect the codon usage patterns shared by the viruses and the hosts [25,26]. HostG [27] utilizes the shared protein clusters between viruses and prokaryotes to create a knowledge graph and trains a graph convolutional network for prediction.…”

Section: Related Workmentioning

confidence: 99%

“…Dataset: we followed [24] and used the same virus-host relationship benchmark dataset for training (the VHM dataset) and testing (the TEST dataset). The detailed information is shown in Fig.…”

Section: Identifying Viruses That Infect Pathogenic Bacteriamentioning

confidence: 99%

See 2 more Smart Citations

CHERRY: a Computational metHod for accuratE pRediction of virus-pRokarYotic interactions using a graph encoder-decoder model

Shang,

Sun

2022

Preprint

View full text Add to dashboard Cite

Prokaryotic viruses, which infect bacteria and archaea, are key players in microbial communities. Predicting the hosts of prokaryotic viruses helps decipher the dynamic relationship between microbes. Although there are experimental methods for host identification, they are either labor-intensive or require the cultivation of the host cells, creating a need for computational host prediction. Despite some promising results, computational host prediction remains a challenge because of the limited known interactions and the sheer amount of sequenced phages by high-throughput sequencing technologies. The state-of-the-art methods can only achieve 43% accuracy at the species level. This work presents CHERRY, a tool formulating host prediction as link prediction in a knowledge graph. As a virus-prokaryotic interaction prediction tool, CHERRY can be applied to predict hosts for newly discovered viruses and also the viruses infecting antibiotic-resistant bacteria. We demonstrated the utility of CHERRY for both applications and compared its performance with the state-of-the-art methods in different scenarios. To our best knowledge, CHERRY has the highest accuracy in identifying virus-prokaryote interactions. It outperforms all the existing methods at the species level with an accuracy increase of 37%. In addition, CHERRY's performance is more stable on short contigs than other tools.

show abstract

Section: Resultsmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

CHERRY: a Computational metHod for accuratE pRediction of virus-pRokarYotic interactions using a graph encoder-decoder model

Shang,

Sun

2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Several attempts have been made to predict hosts for phages based on the genomic sequences. They can be roughly divided into two groups: alignment-based [15,16] and learning-based [10,17,18] models. Alignment-based methods utilize sequence similarity search between query contigs and reference genomes of candidate hosts (bacteria).…”

Section: Related Workmentioning

confidence: 99%

Predicting the hosts of prokaryotic viruses using GCN-based semi-supervised learning

Shang,

Sun

2021

Preprint

View full text Add to dashboard Cite

Motivation: Bacteriophages (aka phages) are viruses that infect bacteria and archaea. Thus, they play important regulatory roles in natural and host-associated ecosystems. As the most abundant and diverse biological entities in the biosphere, phages have received increased attention in their research and applications. In particular, identifying their hosts provides key knowledge for their usages as antibiotics. High-throughput sequencing and its application to the microbiome have offered new opportunities for phage host detection. However, there are two main challenges for computational host prediction. First, the known phage-host relationships are very limited compared to sequenced phages. Second, although the sequence similarity between phages and bacteria has been used as a major feature for host prediction, the alignment is either missing or ambiguous for accurate host prediction. Thus, there is still a need to improve the accuracy of host prediction. Results: In this work, we present a semi-supervised learning model, named HostG, to conduct host prediction for novel phages. We construct a knowledge graph by utilizing both phage-phage protein similarity and phage-host DNA sequence similarity. Then graph convolutional network (GCN) is adopted to exploit phages with or without known hosts in training to enhance the learning ability. During the GCN training, we minimize the expected calibrated error (ECE) to ensure the confidence of the predictions. We tested HostG on both simulated and real sequencing data and the results demonstrated that it competes favorably against the state-of-the-art pipelines.

show abstract

“…First, both lytic and temperate phages can integrate the host genetic materials into their genomes, leading to local sequence similarities between the genomes of phages and bacteria [13]. For example, ~76% phages with known hosts in the RefSeq database have detectable alignments (E-value <1e-5) with their host genomes [14]. These common regions will pose challenges for distinguishing phages from their bacterial hosts.…”

Section: Introductionmentioning

confidence: 99%

Accurate identification of bacteriophages from metagenomic data using Transformer

Shang,

Tang,

Guo

et al. 2022

Preprint

View full text Add to dashboard Cite

Motivation: Bacteriophages are viruses infecting bacteria. Being key players in microbial communities, they can regulate the composition/function of microbiome by infecting their bacterial hosts and mediating gene transfer. Recently, metagenomic sequencing, which can sequence all genetic materials from various microbiome, has become a popular means for new phage discovery. However, accurate and comprehensive detection of phages from the metagenomic data remains difficult. High diversity/abundance, and limited reference genomes pose major challenges for recruiting phage fragments from metagenomic data. Existing alignment-based or learning-based models have either low recall or precision on metagenomic data. Results: In this work, we adopt the state-of-the-art language model, Transformer, to conduct contextual embedding for phage contigs. By constructing a protein-cluster vocabulary, we can feed both the protein composition and the proteins' positions from each contig into the Transformer. The Transformer can learn the protein organization and associations using the self-attention mechanism and predicts the label for test contigs. We rigorously tested our developed tool named PhaMer on multiple datasets with increasing difficulty, including quality RefSeq genomes, short contigs, simulated metagenomic data, mock metagenomic data, and the public IMG/VR dataset. All the experimental results show that PhaMer outperforms the state-of-the-art tools. In the real metagenomic data experiment, PhaMer improves the F1-score of phage detection by 27%.

show abstract

Prokaryotic virus Host Predictor: a Gaussian model for host prediction of prokaryotic viruses in metagenomics

Cited by 8 publications

References 28 publications

CHERRY: a Computational metHod for accuratE pRediction of virus-pRokarYotic interactions using a graph encoder-decoder model

CHERRY: a Computational metHod for accuratE pRediction of virus-pRokarYotic interactions using a graph encoder-decoder model

Predicting the hosts of prokaryotic viruses using GCN-based semi-supervised learning

Accurate identification of bacteriophages from metagenomic data using Transformer

Contact Info

Product

Resources

About