Background: Identifying splice sites is a necessary step to analyze the location and structure of genes. Two dinucleotides, GT and AG, are highly frequent on splice sites, and many other patterns are also on splice sites with important biological functions. Meanwhile, the dinucleotides occur frequently at the sequences without splice sites, which makes the prediction prone to generate false positives. Most existing tools select all the sequences with the two dimers and then focus on distinguishing the true splice sites from those pseudo ones. Such an approach will lead to a decrease in false positives; however, it will result in non-canonical splice sites missing. Result: We have designed SpliceFinder based on convolutional neural network (CNN) to predict splice sites. To achieve the ab initio prediction, we used human genomic data to train our neural network. An iterative approach is adopted to reconstruct the dataset, which tackles the data unbalance problem and forces the model to learn more features of splice sites. The proposed CNN obtains the classification accuracy of 90.25%, which is 10% higher than the existing algorithms. The method outperforms other existing methods in terms of area under receiver operating characteristics (AUC), recall, precision, and F1 score. Furthermore, SpliceFinder can find the exact position of splice sites on long genomic sequences with a sliding window. Compared with other state-of-the-art splice site prediction tools, SpliceFinder generates results in about half lower false positive while keeping recall higher than 0.8. Also, SpliceFinder captures the non-canonical splice sites. In addition, SpliceFinder performs well on the genomic sequences of Drosophila melanogaster, Mus musculus, Rattus, and Danio rerio without retraining. Conclusion: Based on CNN, we have proposed a new ab initio splice site prediction tool, SpliceFinder, which generates less false positives and can detect non-canonical splice sites. Additionally, SpliceFinder is transferable to other species without retraining. The source code and additional materials are available at https://gitlab.deepomics. org/wangruohan/SpliceFinder.
Next-generation sequencing expands the known phage genomes rapidly. Unlike culture-based methods, the hosts of phages discovered from next-generation sequencing data remain uncharacterized. The high diversity of the phage genomes makes the host assignment task challenging. To solve the issue, we proposed a phage host prediction tool—DeepHost. To encode the phage genomes into matrices, we design a genome encoding method that applied various spaced $k$-mer pairs to tolerate sequence variations, including insertion, deletions, and mutations. DeepHost applies a convolutional neural network to predict host taxonomies. DeepHost achieves the prediction accuracy of 96.05% at the genus level (72 taxonomies) and 90.78% at the species level (118 taxonomies), which outperforms the existing phage host prediction tools by 10.16–30.48% and achieves comparable results to BLAST. For the genomes without hits in BLAST, DeepHost obtains the accuracy of 38.00% at the genus level and 26.47% at the species level, making it suitable for genomes of less homologous sequences with the existing datasets. DeepHost is alignment-free, and it is faster than BLAST, especially for large datasets. DeepHost is available at https://github.com/deepomicslab/DeepHost.
The advances of single-cell DNA sequencing (scDNA-seq) enable us to characterize the genetic heterogeneity of cancer cells. However, the high noise and low coverage of scDNA-seq impede the estimation of copy number variations (CNVs). In addition, existing tools suffer from intensive execution time and often fail on large datasets. Here, we propose SeCNV, an efficient method that leverages structural entropy, to profile the copy numbers. SeCNV adopts a local Gaussian kernel to construct a matrix, depth congruent map (DCM), capturing the similarities between any two bins along the genome. Then, SeCNV partitions the genome into segments by minimizing the structural entropy from the DCM. With the partition, SeCNV estimates the copy numbers within each segment for cells. We simulate nine datasets with various breakpoint distributions and amplitudes of noise to benchmark SeCNV. SeCNV achieves a robust performance, i.e. the F1-scores are higher than 0.95 for breakpoint detections, significantly outperforming state-of-the-art methods. SeCNV successfully processes large datasets (>50 000 cells) within 4 min, while other tools fail to finish within the time limit, i.e. 120 h. We apply SeCNV to single-nucleus sequencing datasets from two breast cancer patients and acoustic cell tagmentation sequencing datasets from eight breast cancer patients. SeCNV successfully reproduces the distinct subclones and infers tumor heterogeneity. SeCNV is available at https://github.com/deepomicslab/SeCNV.
Temperate phages (active prophages induced from bacteria) help control pathogenicity, modulate community structure, and maintain gut homeostasis. Complete phage genome sequences are indispensable for understanding phage biology. Traditional plaque techniques are inapplicable to temperate phages due to their lysogenicity, curbing their identification and characterization. Existing bioinformatics tools for prophage prediction usually fail to detect accurate and complete temperate phage genomes. This study proposes a novel computational temperate phage detection method (TemPhD) mining both the integrated active prophages and their spontaneously induced forms (temperate phages) from next-generation sequencing raw data. Applying the method to the available dataset resulted in 192 326 complete temperate phage genomes with different host species, expanding the existing number of complete temperate phage genomes by more than 100-fold. The wet-lab experiments demonstrated that TemPhD can accurately determine the complete genome sequences of the temperate phages, with exact flanking sites, outperforming other state-of-the-art prophage prediction methods. Our analysis indicates that temperate phages are likely to function in the microbial evolution by (i) cross-infecting different bacterial host species; (ii) transferring antibiotic resistance and virulence genes and (iii) interacting with hosts through restriction-modification and CRISPR/anti-CRISPR systems. This work provides a comprehensively complete temperate phage genome database and relevant information, which can serve as a valuable resource for phage research.
Single-cell sequencing technology enables the simultaneous capture of multiomic data from multiple cells. The captured data can be represented by tensors, i.e. the higher-rank matrices. However, the existing analysis tools often take the data as a collection of two-order matrices, renouncing the correspondences among the features. Consequently, we propose a probabilistic tensor decomposition framework, SCOIT, to extract embeddings from single-cell multiomic data. SCOIT incorporates various distributions, including Gaussian, Poisson, and negative binomial distributions, to deal with sparse, noisy, and heterogeneous single-cell data. Our framework can decompose a multiomic tensor into a cell embedding matrix, a gene embedding matrix, and an omic embedding matrix, allowing for various downstream analyses. We applied SCOIT to eight single-cell multiomic datasets from different sequencing protocols. With cell embeddings, SCOIT achieves superior performance for cell clustering compared to nine state-of-the-art tools under various metrics, demonstrating its ability to dissect cellular heterogeneity. With the gene embeddings, SCOIT enables cross-omics gene expression analysis and integrative gene regulatory network study. Furthermore, the embeddings allow cross-omics imputation simultaneously, outperforming current imputation methods with the Pearson correlation coefficient increased by 3.38–39.26%; moreover, SCOIT accommodates the scenario that subsets of the cells are with merely one omic profile available.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.