Background:
Molecular biomarkers show new ways to understand many disease
processes. Noncoding RNAs as biomarkers play a crucial role in several cellular activities, which
are highly correlated to many human diseases especially cancer. The classification and the
identification of ncRNAs have become a critical issue due to their application, such as biomarkers
in many human diseases.
Objective:
Most existing computational tools for ncRNA classification are mainly used for
classifying only one type of ncRNA. They are based on structural information or specific known
features. Furthermore, these tools suffer from a lack of significant and validated features.
Therefore, the performance of these methods is not always satisfactory.
Methods:
We propose a novel approach named imCnC for ncRNA classification based on
multisource deep learning, which integrates several data sources such as genomic and epigenomic
data to identify several ncRNA types. Also, we propose an optimization technique to visualize the
extracted features pattern from the multisource CNN model to measure the epigenomics features
of each ncRNA type.
Results:
The computational results using a dataset of 16 human ncRNA classes downloaded from
RFAM show that imCnC outperforms the existing tools. Indeed, imCnC achieved an accuracy of
94,18%. In addition, our method enables to discover new ncRNA features using an optimization
technique to measure and visualize the features pattern of the imCnC classifier.
Background:
Metagenomics is the study of genomic content in mass from an environment of interest such as the human gut or soil. Taxonomy is one of the most important fields of metagenomics, which is the science of defining and naming groups of microbial organisms that share the same characteristics. The problem of taxonomy classification is the identification and quantification of microbial species or higher-level taxa sampled by high throughput sequencing.
Objective:
Although many methods exist to deal with the taxonomic classification problem, assignment to low taxonomic ranks remains an important challenge for binning methods as is scalability to Gb-sized datasets generated with deep sequencing techniques.
Methods:
In this paper, we introduce NLP-MeTaxa, a novel composition-based method for taxonomic binning, which relies on the use of words embeddings and deep learning architecture. The new proposed approach is word-based, where the metagenomic DNA fragments are processed as a set of overlapping words by using the word2vec model to vectorize them in order to feed the deep learning model. NLP-MeTaxa output is visualized as NCBI taxonomy tree, this representation helps to show the connection between the predicted taxonomic identifiers. NLP-MeTaxa was trained on large-scale data from the NCBI RefSeq, more than 14,000 complete microbial genomes. The NLP-MeTaxa code is available at the website: https://github.com/padriba/NLP_MeTaxa/
Results:
We evaluated NLP-MeTaxa with a real and simulated metagenomic dataset and compared our results to other tools' results. The experimental results have shown that our method outperforms the other methods especially for the classification of low-ranking taxonomic class such as species and genus.
Conclusion:
In summary, our new method might provide novel insight for understanding the microbial community through the identification of the organisms it might contain.
Quantitative structure-activity relationship (QSAR) approach is one of the most commonly used methods for prediction of biological properties to aid the drug discovery process. It is an adequate alternative way for expensive and time-consuming ecotoxicological experiments. Since the mid-1960s, QSAR paradigm ('similar compounds have similar activities') remains the foundation of all QSA&R models developed so far (Heppner, 1988).
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.