:
In the last years, similarity searching has gained wide popularity as a method for performing ligand-based virtual screening (LBVS). This screening technique functions by making a comparison of the target compound’s features with that of each compound’s features in the database of compounds. It is well known that none of the individual similarity measure could provide the best performances each time pertaining to active compound structure representing all types of activity classes. In the literature, we find several techniques and strategies that have been proposed to improve the overall effectiveness of ligand-based virtual screening approaches.
In this paper, a genetic algorithm-based feature selection approach is put forward to improve similarity searching pertaining to ligand-based virtual screening. In this study, we demonstrated how genetic algorithms can be applied to enable optimisation of screening process’s performance by choosing the most relevant features. Three different benchmark datasets taken from the MDDR (drug data report database) are employed to examine and assess the performance of our proposed approach. The obtained results demonstrate superiority in performances compared with these obtained with Tanimoto coefficient, which is considered as the most performing coefficient of the domain of LBVS.
Background:
Metagenomics is the study of genomic content in mass from an environment of interest such as the human gut or soil. Taxonomy is one of the most important fields of metagenomics, which is the science of defining and naming groups of microbial organisms that share the same characteristics. The problem of taxonomy classification is the identification and quantification of microbial species or higher-level taxa sampled by high throughput sequencing.
Objective:
Although many methods exist to deal with the taxonomic classification problem, assignment to low taxonomic ranks remains an important challenge for binning methods as is scalability to Gb-sized datasets generated with deep sequencing techniques.
Methods:
In this paper, we introduce NLP-MeTaxa, a novel composition-based method for taxonomic binning, which relies on the use of words embeddings and deep learning architecture. The new proposed approach is word-based, where the metagenomic DNA fragments are processed as a set of overlapping words by using the word2vec model to vectorize them in order to feed the deep learning model. NLP-MeTaxa output is visualized as NCBI taxonomy tree, this representation helps to show the connection between the predicted taxonomic identifiers. NLP-MeTaxa was trained on large-scale data from the NCBI RefSeq, more than 14,000 complete microbial genomes. The NLP-MeTaxa code is available at the website: https://github.com/padriba/NLP_MeTaxa/
Results:
We evaluated NLP-MeTaxa with a real and simulated metagenomic dataset and compared our results to other tools' results. The experimental results have shown that our method outperforms the other methods especially for the classification of low-ranking taxonomic class such as species and genus.
Conclusion:
In summary, our new method might provide novel insight for understanding the microbial community through the identification of the organisms it might contain.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.