Selection of key sequence-based features for prediction of essential genes in 31 diverse bacterial species

Liu, Xiao; Wang, Bao-Jin; Luo, Xu; Tang, Hexiao; Xu, Guanghui

doi:10.1371/journal.pone.0174638

Cited by 28 publications

(49 citation statements)

References 30 publications

(26 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We extracted features from gene nucleotide sequences and protein sequences. Several features derived from sequence data have been validated their usefulness in predicting gene essentiality in model organisms [10,16]. In this paper, we used the following sequence derived features: codon frequency, maximum relative synonymous codon usage (RSCUmax), codon adaptation index (CAI), gene length, GC content, amino acid frequency, and protein sequence length.…”

Section: Features Derived From Sequence Datamentioning

confidence: 99%

DeepHE: Accurately Predicting Human Essential Genes based on Deep Learning

Zhang

Xiao

2020

Preprint

View full text Add to dashboard Cite

Motivation: Accurately predicting essential genes using computational methods can greatly reduce the effort in finding them via wet experiments at both time and resource scales, and further accelerate the process of drug discovery. Several computational methods have been proposed for predicting essential genes in model organisms by integrating multiple biological data sources either via centrality measures or machine learning based methods. However, the methods aiming to predict human essential genes are still limited and the performance still need improve. In addition, most of the machine learning based essential gene prediction methods are lack of skills to handle the imbalanced learning issue inherent in the essential gene prediction problem, which might be one factor affecting their performance. Results:We proposed a deep learning based method, DeepHE, to predict human essential genes by integrating features derived from sequence data and protein-protein interaction (PPI) network. A deep learning based network embedding method was utilized to automatically learn features from PPI network. In addition, 89 sequence features were derived from DNA sequence and protein sequence for each gene. These two types of features were integrated to train a multilayer neural network. A cost-sensitive technique was used to address the imbalanced learning problem when training the deep neural network. The experimental results for predicting human essential genes showed that our proposed method, DeepHE, can accurately predict human gene essentiality with an average AUC higher than 94%, the area under precision-recall curve (AP) higher than 90%, and the accuracy higher than 90%. We also compared DeepHE with several widely used traditional machine learning models (SVM, Naïve Bayes, Random Forest, Adaboost). The experimental results showed that DeepHE greatly outperformed the compared machine learning models. Conclusions:We demonstrated that human essential genes can be accurately predicted by designing effective machine learning algorithm and integrating representative features captured from available biological data. The proposed deep learning framework is effective for such task.Essential genes are a subset of genes which are indispensable to the survival or reproduction of a living organism. The prediction of gene essentiality is very important for understanding the minimal requirements of an organism, identifying disease genes, and finding new drug targets. The discovery of essential genes via wet-lab experimental methods are often time-consuming, laborious, and costly. With the accumulation of gene essentiality data in some model organisms and human cell lines, many computational methods have been proposed to predict essential genes by exploring the correlations between gene essentiality and all sorts of biological information.One focus in this direction is network based centrality measures. Many studies have demonstrated that highly connected proteins in a protein-protein interaction (PPI) network are more likely to be esse...

show abstract

Section: Features Derived From Sequence Datamentioning

confidence: 99%

DeepHE: Accurately Predicting Human Essential Genes based on Deep Learning

Zhang

Xiao

2020

Preprint

View full text Add to dashboard Cite

show abstract

“…Our experiments showed that DeeplyEssential has better predictive performance 291 both on down-sampled and clustered datasets. On the down-sampled dataset used 292 in [23], DeeplyEssential showed an improvement of 12.8% in AUC compared to [23] 293 and achieved a slightly better AUC on the network-based feature model [2]. In addition, 294 DeeplyEssential produced significantly better sensitivity and precision than the 295 three methods in Table 5, achieving 6.2% improved sensitivity and 137.4% improved 296 precision compare to [2].…”

Section: Comparison With Methods That Address Orthologus Genes 249mentioning

confidence: 93%

“…With the 14 introduction of large gene database such as DEG, CEG and OGEE [4, 25, 40], researchers 15 designed more complex prediction models using a wider set of features. These features 16 can be broadly categorized into (i) sequence features, i.e., codon frequency, GC content, 17 gene length [29, 35, 42], (ii) topological features, i.e., degree centrality, cluster 18 coefficient [1, 6, 24, 31], and (iii) functional features, i.e., homology, gene expression 19cellular localization, functional domain and molecular properties [5,9,23,30,39].Sequence based features can be directly obtained from the primary DNA sequence of 21 a gene and its corresponding protein sequence. Functional features such as network 22 topology requires knowledge of protein-protein interaction network, e.g., STRING and 23 HumanNET [15,37].…”

mentioning

confidence: 99%

“…In order to improve the performance of our classifier, we balanced the dataset by 89 downsampling non-essential genes. Codon frequency has been recognized an important feature for gene essentiality 99 prediction [23,30]. Given the primary DNA sequence of a gene, its codon frequency is 100 computed by sliding a window of three nucleotides along the gene.…”

mentioning

confidence: 99%

“…The random selection was repeated ten times, 177 i.e., a ten-fold cross-validation was performed to complete the inference. The tools described in [23], [30], [29] and [28] are currently unavailable. We ran 180 DeeplyEssential on the datasets used in the corresponding papers, and compared 181 DeeplyEssential's classification metrics to the published metrics.…”

mentioning

confidence: 99%

See 2 more Smart Citations

DeeplyEssential: A Deep Neural Network for Predicting Essential Genes in Microbes

Hasan

Lonardi

2019

Preprint

View full text Add to dashboard Cite

Essential genes are genes that critical for the survival of an organism. The prediction of essential genes in bacteria can provide targets for the design of novel antibiotic compounds or antimicrobial strategies. Here we propose a deep neural network (DNN) for predicting essential genes in microbes. Our DNN-based architecture called DeeplyEssential makes minimal assumptions about the input data (i.e., it only uses gene primary sequence and the corresponding protein sequence) to carry out the prediction, thus maximizing its practical application compared to existing predictors that require structural or topological features which might not be readily available. Our extensive experimental results show that DeeplyEssential outperforms existing classifiers that either employ down-sampling to balance the training set or use clustering to exclude multiple copies of orthologous genes. We also expose and study a hidden performance bias that affected previous classifiers.The code of DeeplyEssential is freely available at https://github.com/ucrbioinfo/DeeplyEssential 1 Introduction 1 Essential genes are those genes that are critical for the survival and reproduction of an 2 organism [17]. Since the disruption of essential genes induces the death of an organism, 3 the identification of essential genes can provide targets for new antimicrobial/antibiotic 4 drugs [7, 13]. The set of essential genes is also critical for the creation of artificial 5 self-sustainable living cells with a minimal genome [16]. Essential genes have also been a 6 cornerstone in understanding the origin and evolution of organisms [18]. 7 The identification of essential genes via wet-lab experiments is labor intensive, 8 expensive and time consuming. Such experimental procedures include single gene 9 knock-out [3, 12], RNA interference and transposon mutagenesis [8, 32]. Moreover, these 10 experimental approaches can produce contradicting results [23]. With the recent 11 advances in high-throughput sequencing technology, computational methods for 12 predicting essential genes has become a reality. Some of the early prediction methods 13 used comparative approaches by homology mapping, see, e.g., [27, 43]. With the 14 introduction of large gene database such as DEG, CEG and OGEE [4, 25, 40], researchers 15 designed more complex prediction models using a wider set of features. These features 16 can be broadly categorized into (i) sequence features, i.e., codon frequency, GC content, 17 gene length [29, 35, 42], (ii) topological features, i.e., degree centrality, cluster 18 coefficient [1, 6, 24, 31], and (iii) functional features, i.e., homology, gene expression 19cellular localization, functional domain and molecular properties [5,9,23,30,39].Sequence based features can be directly obtained from the primary DNA sequence of 21 a gene and its corresponding protein sequence. Functional features such as network 22 topology requires knowledge of protein-protein interaction network, e.g., STRING and 23 HumanNET [15,37]. Gene expression and functional dom...

show abstract

Recent advances in the characterization of essential genes and development of a database of essential genes

Liang,

Luo,

Lin

et al. 2024

iMeta

View full text Add to dashboard Cite

Over the past few decades, there has been a significant interest in the study of essential genes, which are crucial for the survival of an organism under specific environmental conditions and thus have practical applications in the fields of synthetic biology and medicine. An increasing amount of experimental data on essential genes has been obtained with the continuous development of technological methods. Meanwhile, various computational prediction methods, related databases and web servers have emerged accordingly. To facilitate the study of essential genes, we have established a database of essential genes (DEG), which has become popular with continuous updates to facilitate essential gene feature analysis and prediction, drug and vaccine development, as well as artificial genome design and construction. In this article, we summarized the studies of essential genes, overviewed the relevant databases, and discussed their practical applications. Furthermore, we provided an overview of the main applications of DEG and conducted comprehensive analyses based on its latest version. However, it should be noted that the essential gene is a dynamic concept instead of a binary one, which presents both opportunities and challenges for their future development.

show abstract

Selection of key sequence-based features for prediction of essential genes in 31 diverse bacterial species

Cited by 28 publications

References 30 publications

DeepHE: Accurately Predicting Human Essential Genes based on Deep Learning

DeepHE: Accurately Predicting Human Essential Genes based on Deep Learning

DeeplyEssential: A Deep Neural Network for Predicting Essential Genes in Microbes

Recent advances in the characterization of essential genes and development of a database of essential genes

Contact Info

Product

Resources

About