Patterns of genetic diversity within species contain information the history of that species, including how they have responded to historical climate change and how easily the organism is able to disperse across its habitat. More than 40,000 phylogeographic and population genetic investigations have been published to date, each collecting genetic data from hundreds of samples. Despite these millions of data points, meta‐analyses are challenging because the synthesis of results across hundreds of studies, each using different methods and forms of analysis, is a daunting and time‐consuming task. It is more efficient to proceed by repurposing existing data and using automated data analysis. To facilitate data repurposing, we created a database (phylogatR) that aggregates data from different sources and conducts automated multiple sequence alignments and data curation to provide users with nearly ready‐to‐analyse sets of data for thousands of species. Two types of scientific research will be made easier by phylogatR: large meta‐analyses of thousands of species that can address classic questions in evolutionary biology and ecology, and student‐ or citizen‐ science based investigations that will introduce a broad range of people to the analysis of genetic data. phylogatR enhances the value of existing data via the creation of software and web‐based tools that enable these data to be recycled and reanalysed and increase accessibility to big data for research laboratories and classroom instructors with limited computational expertise and resources.
Background Long thought “relics” of evolution, not until recently have pseudogenes been of medical interest regarding regulation in cancer. Often, these regulatory roles are a direct by-product of their close sequence homology to protein-coding genes. Novel pseudogene-gene (PGG) functional associations can be identified through the integration of biomedical data, such as sequence homology, functional pathways, gene expression, pseudogene expression, and microRNA expression. However, not all of the information has been integrated, and almost all previous pseudogene studies relied on 1:1 pseudogene–parent gene relationships without leveraging other homologous genes/pseudogenes. Results We produce PGG families that expand beyond the current 1:1 paradigm. First, we construct expansive PGG databases by (i) CUDAlign graphics processing unit (GPU) accelerated local alignment of all pseudogenes to gene families (totaling 1.6 billion individual local alignments and >40,000 GPU hours) and (ii) BLAST-based assignment of pseudogenes to gene families. Second, we create an open-source web application (PseudoFuN [Pseudogene Functional Networks]) to search for integrative functional relationships of sequence homology, microRNA expression, gene expression, pseudogene expression, and gene ontology. We produce four “flavors” of CUDAlign-based databases (>462,000,000 PGG pairwise alignments and 133,770 PGG families) that can be queried and downloaded using PseudoFuN. These databases are consistent with previous 1:1 PGG annotation and also are much more powerful including millions of de novo PGG associations. For example, we find multiple known (e.g., miR-20a - PTEN - PTENP1 ) and novel (e.g., miR-375 - SOX15 - PPP4R1L ) microRNA-gene-pseudogene associations in prostate cancer. PseudoFuN provides a “one stop shop” for identifying and visualizing thousands of potential regulatory relationships related to pseudogenes in The Cancer Genome Atlas cancers. Conclusions Thousands of new PGG associations can be explored in the context of microRNA-gene-pseudogene co-expression and differential expression with a simple-to-use online tool by bioinformaticians and oncologists alike.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.