Accurate gene prediction in metagenomics fragments is a computationally challenging task due to the short-read length, incomplete, and fragmented nature of the data. Most gene-prediction programs are based on extracting a large number of features and then applying statistical approaches or supervised classification approaches to predict genes. In our study, we introduce a convolutional neural network for metagenomics gene prediction (CNN-MGP) program that predicts genes in metagenomics fragments directly from raw DNA sequences, without the need for manual feature extraction and feature selection stages. CNN-MGP is able to learn the characteristics of coding and non-coding regions and distinguish coding and non-coding open reading frames (ORFs). We train 10 CNN models on 10 mutually exclusive datasets based on pre-defined GC content ranges. We extract ORFs from each fragment; then, the ORFs are encoded numerically and inputted into an appropriate CNN model based on the fragment-GC content. The output from the CNN is the probability that an ORF will encode a gene. Finally, a greedy algorithm is used to select the final gene list. Overall, CNN-MGP is effective and achieves a 91% accuracy on testing dataset. CNN-MGP shows the ability of deep learning to predict genes in metagenomics fragments, and it achieves an accuracy higher than or comparable to state-of-the-art gene-prediction programs that use pre-defined features.
BackgroundComputational gene finding algorithms have proven their robustness in identifying genes in complete genomes. However, metagenomic sequencing has presented new challenges due to the incomplete and fragmented nature of the data. During the last few years, attempts have been made to extract complete and incomplete open reading frames (ORFs) directly from short reads and identify the coding ORFs, bypassing other challenging tasks such as the assembly of the metagenome.ResultsIn this paper we introduce a metagenomics gene caller (MGC) which is an improvement over the state-of-the-art prediction algorithm Orphelia. Orphelia uses a two-stage machine learning approach and computes a model that classifies extracted ORFs from fragmented sequences. We hypothesise and demonstrate evidence that sequences need separate models based on their local GC-content in order to avoid the noise introduced to a single model computed with sequences from the entire GC spectrum. We have also added two amino-acid features based on the benefit of amino-acid usage shown in our previous research. Our algorithm is able to predict genes and translation initiation sites (TIS) more accurately than Orphelia which uses a single model.ConclusionsLearning separate models for several pre-defined GC-content regions as opposed to a single model approach improves the performance of the neural network as demonstrated by the experimental results presented in this paper. The inclusion of amino-acid usage features also helps improve the overall accuracy of our algorithm. MGC's improvement sets the ground for further investigation into the use of GC-content to separate data for training models in machine learning based gene finders.
Soil salinity is significant abiotic stress that severely limits global crop production. Chickpea (Cicer arietinum L.) is an important grain legume that plays a substantial role in nutritional food security, especially in the developing world. This study used a chickpea population collected from the International Center for Agricultural Research in the Dry Area (ICARDA) genebank using the focused identification of germplasm strategy. The germplasm included 186 genotypes with broad Asian and African origins and genotyped with 1856 DArTseq markers. We conducted phenotyping for salinity in the field (Arish, Sinai, Egypt) and greenhouse hydroponic experiments at 100 mM NaCl concentration. Based on the performance in both hydroponic and field experiments, we identified seven genotypes from Azerbaijan and Pakistan (IGs: 70782, 70430, 70764, 117703, 6057, 8447, and 70249) as potential sources for high salinity tolerance. Multi-trait genome-wide association analysis (mtGWAS) detected one locus on chromosome Ca4 at 10618070 bp associated with salinity tolerance under hydroponic and field conditions. In addition, we located another locus specific to the hydroponic system on chromosome Ca2 at 30537619 bp. Gene annotation analysis revealed the location of rs5825813 within the Embryogenesis-associated protein (EMB8-like), while the location of rs5825939 is within the Ribosomal Protein Large P0 (RPLP0). Utilizing such markers in practical breeding programs can effectively improve the adaptability of current chickpea cultivars in saline soil. Moreover, researchers can use our markers to facilitate the incorporation of new genes into commercial cultivars.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.