Motivation
The computational prediction of regulatory function associated with a genomic sequence is of utter importance in -omics study, which facilitates our understanding of the underlying mechanisms underpinning the vast gene regulatory network. Prominent examples in this area include the binding prediction of transcription factors in DNA regulatory regions, and predicting RNA–protein interaction in the context of post-transcriptional gene expression. However, existing computational methods have suffered from high false-positive rates and have seldom used any evolutionary information, despite the vast amount of available orthologous data across multitudes of extant and ancestral genomes, which readily present an opportunity to improve the accuracy of existing computational methods.
Results
In this study, we present a novel probabilistic approach called PhyloPGM that leverages previously trained TFBS or RNA–RBP binding predictors by aggregating their predictions from various orthologous regions, in order to boost the overall prediction accuracy on human sequences. Throughout our experiments, PhyloPGM has shown significant improvement over baselines such as the sequence-based RNA–RBP binding predictor RNATracker and the sequence-based TFBS predictor that is known as FactorNet. PhyloPGM is simple in principle, easy to implement and yet, yields impressive results.
Availability and implementation
The PhyloPGM package is available at https://github.com/BlanchetteLab/PhyloPGM
Supplementary information
Supplementary data are available at Bioinformatics online.
Crowdsourcing through human-computing games is an increasingly popular practice for classifying and analyzing scientific data. Early contributions such as Phylo have now been running for several years. The analysis of the performance of these systems enables us to identify patterns that contributed to their successes, but also possible pitfalls. In this paper, we review the results and user statistics collected since 2010 by our platform Phylo, which aims to engage citizens in comparative genome analysis through a casual tile matching computer game. We also identify features that allow predicting a task difficulty, which is essential for channeling them to human players with the appropriate skill level. Finally, we show how our platform has been used to quickly improve a reference alignment of Ebola virus sequences.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.