Random Forests for Quality Control in G-Protein Coupled Receptor Databases

Shkurin, Aleksei; Vellido, Alfredo

doi:10.1007/978-3-319-31744-1_61

Bioinformatics and Biomedical Engineering

2016

DOI: 10.1007/978-3-319-31744-1_61

|View full text |Cite

Random Forests for Quality Control in G-Protein Coupled Receptor Databases

Aleksei Shkurin

Alfredo Vellido

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...

Citation Types

Supporting

Mentioning

Contrasting

Year Published

2017

Publication Types

Select...

Article1

Relationship

Self Cite1

Independent0

Authors

Journals

Cited by 1 publication

References 14 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

Using random forests for assistance in the curation of G-protein coupled receptor databases

Shkurin

Vellido

2017

BioMed Eng OnLine

Self Cite

View full text Add to dashboard Cite

BackgroundBiology is experiencing a gradual but fast transformation from a laboratory-centred science towards a data-centred one. As such, it requires robust data engineering and the use of quantitative data analysis methods as part of database curation. This paper focuses on G protein-coupled receptors, a large and heterogeneous super-family of cell membrane proteins of interest to biology in general. One of its families, Class C, is of particular interest to pharmacology and drug design. This family is quite heterogeneous on its own, and the discrimination of its several sub-families is a challenging problem. In the absence of known crystal structure, such discrimination must rely on their primary amino acid sequences.MethodsWe are interested not as much in achieving maximum sub-family discrimination accuracy using quantitative methods, but in exploring sequence misclassification behavior. Specifically, we are interested in isolating those sequences showing consistent misclassification, that is, sequences that are very often misclassified and almost always to the same wrong sub-family. Random forests are used for this analysis due to their ensemble nature, which makes them naturally suited to gauge the consistency of misclassification. This consistency is here defined through the voting scheme of their base tree classifiers.ResultsDetailed consistency results for the random forest ensemble classification were obtained for all receptors and for all data transformations of their unaligned primary sequences. Shortlists of the most consistently misclassified receptors for each subfamily and transformation, as well as an overall shortlist including those cases that were consistently misclassified across transformations, were obtained. The latter should be referred to experts for further investigation as a data curation task.ConclusionThe automatic discrimination of the Class C sub-families of G protein-coupled receptors from their unaligned primary sequences shows clear limits. This study has investigated in some detail the consistency of their misclassification using random forest ensemble classifiers. Different sub-families have been shown to display very different discrimination consistency behaviors. The individual identification of consistently misclassified sequences should provide a tool for quality control to GPCR database curators.Electronic supplementary materialThe online version of this article (doi:10.1186/s12938-017-0357-4) contains supplementary material, which is available to authorized users.

show abstract

Using random forests for assistance in the curation of G-protein coupled receptor databases

Shkurin

Vellido

2017

BioMed Eng OnLine

Self Cite

View full text Add to dashboard Cite

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

customersupport@researchsolutions.com

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Random Forests for Quality Control in G-Protein Coupled Receptor Databases

Cited by 1 publication

References 14 publications

Using random forests for assistance in the curation of G-protein coupled receptor databases

Using random forests for assistance in the curation of G-protein coupled receptor databases

Contact Info

Product

Resources

About