ForestQC: Quality control on genetic variants from next-generation sequencing data using random forest

Li, Jiajin; Jew, Brandon; Zhan, Lingyu; Hwang, Sun‐Goo; Coppola, Giovanni; Sul, Jae Hoon

doi:10.1371/journal.pcbi.1007556

Cited by 23 publications

(23 citation statements)

References 55 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In recent years, machine learning has been used for sorting these variants. Therefore, we tested the following eight algorithms that have used filtering small variants: logistic regression (LR) 44 , decision tree (DT) 45 , k-nearest neighbor (kNN) 44 , 45 , random forest (RF) 44 , 45 , linear discriminant analysis (LDA) 45 , naïve Bayes (NB) 44 , 45 , and support vector machine (SVM) 44 . We then used these scores to determine if the target candidate was a true or false positive.…”

Section: Resultsmentioning

confidence: 99%

“…We further investigated whether machine learning was applicable to accurately detect stable gene mutations and rare structural variants. Machine learning techniques have already been used to efficiently detect small mutations 44 , 45 , 51 . In this case, we used non-deep learning methods because we started with a small data set, and we found that k-NN clustering, naïve Bayes, and linear support vector machine algorithms showed a correct identification rate of 95% or more.…”

Section: Discussionmentioning

confidence: 99%

See 1 more Smart Citation

Efficient collection of a large number of mutations by mutagenesis of DNA damage response defective animals

Suehiro

Yoshina

Motohashi

et al. 2021

Sci Rep

View full text Add to dashboard Cite

With the development of massive parallel sequencing technology, it has become easier to establish new model organisms that are ideally suited to the specific biological phenomena of interest. Considering the history of research using classical model organisms, we believe that the efficient construction and sharing of gene mutation libraries will facilitate the progress of studies using these new model organisms. Using C. elegans, we applied the TMP/UV mutagenesis method to animals lacking function in the DNA damage response genes atm-1 and xpc-1. This method produces genetic mutations three times more efficiently than mutagenesis of wild-type animals. Furthermore, we confirmed that the use of next-generation sequencing and the elimination of false positives through machine learning could automate the process of mutation identification with an accuracy of over 95%. Eventually, we sequenced the whole genomes of 488 strains and isolated 981 novel mutations generated by the present method; these strains have been made available to anyone who wants to use them. Since the targeted DNA damage response genes are well conserved and the mutagens used in this study are also effective in a variety of species, we believe that our method is generally applicable to a wide range of animal species.

show abstract

Section: Resultsmentioning

confidence: 99%

Section: Discussionmentioning

confidence: 99%

Efficient collection of a large number of mutations by mutagenesis of DNA damage response defective animals

Suehiro

Yoshina

Motohashi

et al. 2021

Sci Rep

View full text Add to dashboard Cite

show abstract

“…The model was developed in R language (R version 3.6.0, RandomForest version 4.6-14, NeuralNet version 1.44.2) which are Machine Learning open source modules and hence more users can tune this model on their own data by using our training database. ML and more particularly RF model has already been used in several publications as a tool for risk evaluation, quality control or predictions based on sequencing data 2,4,9 . Furthermore, the applications we found using RF rely on the analysis of VCF (Variant Caller) files such as Smurf which predicts a consensus set of somatic mutation calls or Octopus which uses BAM files as training dataset to classify variant calls 9,10 .…”

Section: Discussionmentioning

confidence: 99%

Machine Learning Random Forest for predicting onco-somatic variants NGS analysis

Pellegrino¹,

Jacques²,

Beaufils³

et al. 2021

Preprint

View full text Add to dashboard Cite

Motivation: Since 2017, we are using IonTorrent NGS platform in our hospital in order to diagnose cancer and treatment. Analysis variants at each run take us a longtime and we are still struggling with some variants which look correct on the first look at their metrics but found to be negative when we look further into them. Can any Machine Learning algorithm help us to classify NGS variant calling ? This has determined us to investigate which ML could fit to our NGS data and to develop a tool which can be implemented in Routine in order to help Biologists. Introduction: Nowadays, one of medicine challenges is processing a significant amount of data. It’s particularly true in molecular biology with the advantage of Next Generation Sequencing (NGS) for molecular tumor profile determination and treatment selection. In addition to bioinformatics pipelines, Artificial Intelligence (AI) can offer a very valuable help in analyzing. Generating sequencing data from patient DNA samples has become easy to perform in clinical trials. But analyzing the huge amount of genomic or transcriptomic data and extracting the key biomarkers associated with a clinical response to a specific therapy requires a formidable combination of scientific expertise, biomolecular skill and a panel of bioinformatics and biostatistics tools, in which artificial intelligence is now a success factor in developing future routine diagnostics. However, cancer genome complexity and technical artifacts make identification of real variants a challenge. We present a Machine Learning method to classify pathogenic Single Nucleotide Variants (SNVs), SNP (Single Nucleotide Polymorphism), MNVs (Multiple Nucleotide Variants), Insertion, Deletion detected by NGS from tumors specimens for Colorectal, Melanoma, Lung and Glioma cancer. Methods: We compared our NGS data to different machine learning algorithms using the 10-fold cross validation method and to neural networks (Deep Learning) in order to measure the performance of the different ML algorithms and determine which one is a valid model for confirming NGS variant calls in cancer diagnostic. We trained our Machine Learning with 70 % of our data samples, extracted from our local database (our data structure had 7 parameters: chromosome, position, exon, variant allele frequency, minor allele frequency, coverage and protein description) and validated it with 30 % remaining. The model offering the best accuracy was chosen and implemented in NGS analysis routine. The artificial intelligence was developed with R script language version 3.6.0. Results: We trained our model on 102011 variants. Our best error rate (0.22%) was found with Random Forest Machine Learning (ntree=500 and mtry=4) with an AUC of 0.99. Neural Networks achieved some good scores. The final trained model with Neural Network was able to achieve an accuracy of 98% and a ROC-AUC of 0.99 with validation data. We tested our RF model to interpret more than 2000 variants from our NGS database: 20 variants were misclassified (error rate <1%). The errors were nomenclature problems and false positive. After adding false positive in our training database and implementing our RF model in routine, our error rate was always < 0.5%. Conclusion: Our RF model shows excellent results for onco-somatic NGS interpretation and it could easily be implemented in other molecular biology laboratories. AI is taking an increasingly important place in molecular biomedical analysis and could be very helpful on processing of amount medical data. Neural Networks showed a good capacity in the classification of variants and in the future may be useful in the prediction of more complex variants.

show abstract

“…The trained model is used to either make a binary call on each variant or to assign the variant an overall quality score. The Variant Quality Score Recalibration (VQSR) method from GATK, ForestQC 19 , GATK's new CNNScoreVariants 20 , and GARFIELD-NGS 21 all use a variety of machine learning methods to aid variant filtering.…”

Section: Variant Filteringmentioning

confidence: 99%

Data Analysis in Rare Disease Diagnostics

Veeramachaneni

2020

J Indian Inst Sci

View full text Add to dashboard Cite

Data Analysis in Rare Disease Diagnostics 1 IntroductionA draft human genome covering ~ 95% of the human genome was first released in 2000 1 . The sequence, commonly referred to as the human reference genome sequence, is a composite sequence created by sequencing and painstakingly assembling DNA obtained from anonymous volunteers of diverse backgrounds. This ~ 3 billion nucleotide-long genome sequence has undergone several revisions over the years and there are still small regions that have remained intractable. It is not an exaggeration to state that all clinical genomics applications today use the reference sequence as the basis for analysis.In this article, we focus on the topic of rare disease diagnosis through sequencing. There are over 8600 rare disease phenotypes documented in OMIM today 2 . The molecular basis for 6200 of these diseases has been traced to 3900 genes in the reference genome. Most rare diseases are caused by just one or two variants present in the patient genome. However, identifying the exact variants from among the more than 5 million small variants that distinguish any individual from the reference genome is an extremely challenging task 3 .There are four major steps in the rare disease diagnosis process-sequencing, variant detection, variant assessment, and variant prioritization. In this article, we take you through these steps explaining the data analysis that happens at each

show abstract

ForestQC: Quality control on genetic variants from next-generation sequencing data using random forest

Cited by 23 publications

References 55 publications

Efficient collection of a large number of mutations by mutagenesis of DNA damage response defective animals

Efficient collection of a large number of mutations by mutagenesis of DNA damage response defective animals

Machine Learning Random Forest for predicting onco-somatic variants NGS analysis

Data Analysis in Rare Disease Diagnostics

Contact Info

Product

Resources

About