2019
DOI: 10.1371/journal.pcbi.1007556
|View full text |Cite
|
Sign up to set email alerts
|

ForestQC: Quality control on genetic variants from next-generation sequencing data using random forest

Abstract: Next-generation sequencing technology (NGS) enables the discovery of nearly all genetic variants present in a genome. A subset of these variants, however, may have poor sequencing quality due to limitations in NGS or variant callers. In genetic studies that analyze a large number of sequenced individuals, it is critical to detect and remove those variants with poor quality as they may cause spurious findings. In this paper, we present ForestQC, a statistical tool for performing quality control on variants iden… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
23
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
7
1

Relationship

0
8

Authors

Journals

citations
Cited by 23 publications
(23 citation statements)
references
References 55 publications
0
23
0
Order By: Relevance
“…In recent years, machine learning has been used for sorting these variants. Therefore, we tested the following eight algorithms that have used filtering small variants: logistic regression (LR) 44 , decision tree (DT) 45 , k-nearest neighbor (kNN) 44 , 45 , random forest (RF) 44 , 45 , linear discriminant analysis (LDA) 45 , naïve Bayes (NB) 44 , 45 , and support vector machine (SVM) 44 . We then used these scores to determine if the target candidate was a true or false positive.…”
Section: Resultsmentioning
confidence: 99%
See 1 more Smart Citation
“…In recent years, machine learning has been used for sorting these variants. Therefore, we tested the following eight algorithms that have used filtering small variants: logistic regression (LR) 44 , decision tree (DT) 45 , k-nearest neighbor (kNN) 44 , 45 , random forest (RF) 44 , 45 , linear discriminant analysis (LDA) 45 , naïve Bayes (NB) 44 , 45 , and support vector machine (SVM) 44 . We then used these scores to determine if the target candidate was a true or false positive.…”
Section: Resultsmentioning
confidence: 99%
“…We further investigated whether machine learning was applicable to accurately detect stable gene mutations and rare structural variants. Machine learning techniques have already been used to efficiently detect small mutations 44 , 45 , 51 . In this case, we used non-deep learning methods because we started with a small data set, and we found that k-NN clustering, naïve Bayes, and linear support vector machine algorithms showed a correct identification rate of 95% or more.…”
Section: Discussionmentioning
confidence: 99%
“…The model was developed in R language (R version 3.6.0, RandomForest version 4.6-14, NeuralNet version 1.44.2) which are Machine Learning open source modules and hence more users can tune this model on their own data by using our training database. ML and more particularly RF model has already been used in several publications as a tool for risk evaluation, quality control or predictions based on sequencing data 2,4,9 . Furthermore, the applications we found using RF rely on the analysis of VCF (Variant Caller) files such as Smurf which predicts a consensus set of somatic mutation calls or Octopus which uses BAM files as training dataset to classify variant calls 9,10 .…”
Section: Discussionmentioning
confidence: 99%
“…The trained model is used to either make a binary call on each variant or to assign the variant an overall quality score. The Variant Quality Score Recalibration (VQSR) method from GATK, ForestQC 19 , GATK's new CNNScoreVariants 20 , and GARFIELD-NGS 21 all use a variety of machine learning methods to aid variant filtering.…”
Section: Variant Filteringmentioning
confidence: 99%