Cancer is a genetic disease that develops through a series of somatic mutations, a subset of which drive cancer progression. Although cancer genome sequencing studies are beginning to reveal the mutational patterns of genes in various cancers, identifying the small subset of “causative” mutations from the large subset of “non-causative” mutations, which accumulate as a consequence of the disease, is a challenge. In this article, we present an effective machine learning approach for identifying cancer-associated mutations in human protein kinases, a class of signaling proteins known to be frequently mutated in human cancers. We evaluate the performance of 11 well known supervised learners and show that a multiple-classifier approach, which combines the performances of individual learners, significantly improves the classification of known cancer-associated mutations. We introduce several novel features related specifically to structural and functional characteristics of protein kinases and find that the level of conservation of the mutated residue at specific evolutionary depths is an important predictor of oncogenic effect. We consolidate the novel features and the multiple-classifier approach to prioritize and experimentally test a set of rare unconfirmed mutations in the epidermal growth factor receptor tyrosine kinase (EGFR). Our studies identify T725M and L861R as rare cancer-associated mutations inasmuch as these mutations increase EGFR activity in the absence of the activating EGF ligand in cell-based assays.
Malware clustering is commonly applied by malware analysts to cope with the increasingly growing number of distinct malware variants collected every day from the Internet. While malware clustering systems can be useful for a variety of applications, assessing the quality of their results is intrinsically hard. In fact, clustering can be viewed as an unsupervised learning process over a dataset for which the complete ground truth is usually not available. Previous studies propose to evaluate malware clustering results by leveraging the labels assigned to the malware samples by multiple anti-virus scanners (AVs). However, the methods proposed thus far require a (semi-)manual adjustment and mapping between labels generated by different AVs, and are limited to selecting a reference sub-set of samples for which an agreement regarding their labels can be reached across a majority of AVs. This approach may bias the reference set towards "easy to cluster" malware samples, thus potentially resulting in an overoptimistic estimate of the accuracy of the malware clustering results.In this paper we propose VAMO, a system that provides a fully automated quantitative analysis of the validity of malware clustering results. Unlike previous work, VAMO does not seek a majority voting-based consensus across different AV labels, and does not discard the malware samples for which such a consensus cannot be reached. Rather, VAMO explicitly deals with the inconsistencies typical of multiple AV labels to build a more representative reference set, compared to majority voting-based approaches. Furthermore, VAMO avoids the need of a (semi-)manual mapping between AV labels from different scanners that was required in previous work. Through an extensive evaluation in a controlled setting and a real-world application, we show that VAMO outperforms majority voting-based approaches, and provides a better way for malware analysts to automatically assess the quality of their malware clustering results.
Metagenomics deals with the study of microorganisms such as prokaryotes that are found in samples from natural environments. The samples obtained from the environment may contain DNA from many different species of micro-organisms including bacteria and archea. Microorganisms are responsible for most of the symbiotic activity on earth. They are also responsible for the complex chemical reactions which take place on the surface of the earth, which help maintain earth's ecological balance. With the increase in genome sequencing projects there has been a considerable increase in the amount of assembled sequencing data. In this article, we apply supervised learners namely decision trees, Bayesian networks and decision tables to see how the performance degrades when the number of species present in the metagenomic sample increases. We also try to see how the performance of the metagenomic sample changes as the percentage of unknown sequences in the metagenomic sample is varied.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2025 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.