2017
DOI: 10.1186/s12938-017-0357-4
|View full text |Cite
|
Sign up to set email alerts
|

Using random forests for assistance in the curation of G-protein coupled receptor databases

Abstract: BackgroundBiology is experiencing a gradual but fast transformation from a laboratory-centred science towards a data-centred one. As such, it requires robust data engineering and the use of quantitative data analysis methods as part of database curation. This paper focuses on G protein-coupled receptors, a large and heterogeneous super-family of cell membrane proteins of interest to biology in general. One of its families, Class C, is of particular interest to pharmacology and drug design. This family is quite… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
8
0

Year Published

2018
2018
2022
2022

Publication Types

Select...
6

Relationship

1
5

Authors

Journals

citations
Cited by 8 publications
(8 citation statements)
references
References 22 publications
0
8
0
Order By: Relevance
“…Multivariate models were developed with candidate variables that were significant in the univariate analysis using logistic regression and random forest analysis, and their predictive capabilities for PTS were compared. The logistic regression model is binary whereas the random forest creates multiple training sets for decision trees, wherein each tree is built based on a bootstrap sample drawn randomly from the original dataset using the CART method and the Decrease Gini Impurity as the splitting criterion (12). Furthermore, at each branching, only a given number of randomly selected features were considered as candidates.…”
Section: Model Developmentmentioning
confidence: 99%
“…Multivariate models were developed with candidate variables that were significant in the univariate analysis using logistic regression and random forest analysis, and their predictive capabilities for PTS were compared. The logistic regression model is binary whereas the random forest creates multiple training sets for decision trees, wherein each tree is built based on a bootstrap sample drawn randomly from the original dataset using the CART method and the Decrease Gini Impurity as the splitting criterion (12). Furthermore, at each branching, only a given number of randomly selected features were considered as candidates.…”
Section: Model Developmentmentioning
confidence: 99%
“…Work on the 2011 version of the database provided evidence of clearly defined limits to the separability of the different class C subtypes. This evidence was produced using both supervised 25 , 26 and semi-supervised 22 machine learning approaches and from different data transformation strategies. Interestingly, the subtypes shown to be most responsible for such lack of complete subtype separability were precisely those which were removed in the 2016 versions of the databases (namely vomeronasal, odorant and pheromone receptors).…”
Section: Datamentioning
confidence: 99%
“…Subsequent work reported in 26 , which again employed alignment-free data transformations, used a Random Forest (RF) classifier 36 to further investigate the consistency of misclassification in this problem. Note that RF is an ensemble learning technique 37 with an internal classification voting system that is naturally suited to classification consistency analyses.…”
Section: Datamentioning
confidence: 99%
See 1 more Smart Citation
“…Although assessing how complete or correct a large data set may be remains a challenge, examples have been reported. Examples include computational methods for identifying data updates and artifacts that may be of interest to downstream data consumers [ 1 ], machine learning methods to identify incorrectly classified G-protein coupled receptors [ 2 ], and to improve the quality of large data sets prior to quantitative structure-activity relationship modeling [ 3 ]. The completeness and quality of curated nanomaterial data has also been explored [ 4 ].…”
Section: Introductionmentioning
confidence: 99%