BackgroundRecent innovations in sequencing technologies have provided researchers with the ability to rapidly characterize the microbial content of an environmental or clinical sample with unprecedented resolution. These approaches are producing a wealth of information that is providing novel insights into the microbial ecology of the environment and human health. However, these sequencing-based approaches produce large and complex datasets that require efficient and sensitive computational analysis workflows. Many recent tools for analyzing metagenomic-sequencing data have emerged, however, these approaches often suffer from issues of specificity, efficiency, and typically do not include a complete metagenomic analysis framework.ResultsWe present PathoScope 2.0, a complete bioinformatics framework for rapidly and accurately quantifying the proportions of reads from individual microbial strains present in metagenomic sequencing data from environmental or clinical samples. The pipeline performs all necessary computational analysis steps; including reference genome library extraction and indexing, read quality control and alignment, strain identification, and summarization and annotation of results. We rigorously evaluated PathoScope 2.0 using simulated data and data from the 2011 outbreak of Shiga-toxigenic Escherichia coli O104:H4.ConclusionsThe results show that PathoScope 2.0 is a complete, highly sensitive, and efficient approach for metagenomic analysis that outperforms alternative approaches in scope, speed, and accuracy. The PathoScope 2.0 pipeline software is freely available for download at: http://sourceforge.net/projects/pathoscope/.
BackgroundThe use of sequencing technologies to investigate the microbiome of a sample can positively impact patient healthcare by providing therapeutic targets for personalized disease treatment. However, these samples contain genomic sequences from various sources that complicate the identification of pathogens.ResultsHere we present Clinical PathoScope, a pipeline to rapidly and accurately remove host contamination, isolate microbial reads, and identify potential disease-causing pathogens. We have accomplished three essential tasks in the development of Clinical PathoScope. First, we developed an optimized framework for pathogen identification using a computational subtraction methodology in concordance with read trimming and ambiguous read reassignment. Second, we have demonstrated the ability of our approach to identify multiple pathogens in a single clinical sample, accurately identify pathogens at the subspecies level, and determine the nearest phylogenetic neighbor of novel or highly mutated pathogens using real clinical sequencing data. Finally, we have shown that Clinical PathoScope outperforms previously published pathogen identification methods with regard to computational speed, sensitivity, and specificity.ConclusionsClinical PathoScope is the only pathogen identification method currently available that can identify multiple pathogens from mixed samples and distinguish between very closely related species and strains in samples with very few reads per pathogen. Furthermore, Clinical PathoScope does not rely on genome assembly and thus can more rapidly complete the analysis of a clinical sample when compared with current assembly-based methods. Clinical PathoScope is freely available at: http://sourceforge.net/projects/pathoscope/.Electronic supplementary materialThe online version of this article (doi:10.1186/1471-2105-15-262) contains supplementary material, which is available to authorized users.
BackgroundThe primary objectives of this paper are: 1.) to apply Statistical Learning Theory (SLT), specifically Partial Least Squares (PLS) and Kernelized PLS (K-PLS), to the universal "feature-rich/case-poor" (also known as "large p small n", or "high-dimension, low-sample size") microarray problem by eliminating those features (or probes) that do not contribute to the "best" chromosome bio-markers for lung cancer, and 2.) quantitatively measure and verify (by an independent means) the efficacy of this PLS process. A secondary objective is to integrate these significant improvements in diagnostic and prognostic biomedical applications into the clinical research arena. That is, to devise a framework for converting SLT results into direct, useful clinical information for patient care or pharmaceutical research. We, therefore, propose and preliminarily evaluate, a process whereby PLS, K-PLS, and Support Vector Machines (SVM) may be integrated with the accepted and well understood traditional biostatistical "gold standard", Cox Proportional Hazard model and Kaplan-Meier survival analysis methods. Specifically, this new combination will be illustrated with both PLS and Kaplan-Meier followed by PLS and Cox Hazard Ratios (CHR) and can be easily extended for both the K-PLS and SVM paradigms. Finally, these previously described processes are contained in the Fine Feature Selection (FFS) component of our overall feature reduction/evaluation process, which consists of the following components: 1.) coarse feature reduction, 2.) fine feature selection and 3.) classification (as described in this paper) and prediction.ResultsOur results for PLS and K-PLS showed that these techniques, as part of our overall feature reduction process, performed well on noisy microarray data. The best performance was a good 0.794 Area Under a Receiver Operating Characteristic (ROC) Curve (AUC) for classification of recurrence prior to or after 36 months and a strong 0.869 AUC for classification of recurrence prior to or after 60 months. Kaplan-Meier curves for the classification groups were clearly separated, with p-values below 4.5e-12 for both 36 and 60 months. CHRs were also good, with ratios of 2.846341 (36 months) and 3.996732 (60 months).ConclusionsSLT techniques such as PLS and K-PLS can effectively address difficult problems with analyzing biomedical data such as microarrays. The combinations with established biostatistical techniques demonstrated in this paper allow these methods to move from academic research and into clinical practice.
Cancer cell lines represent the front line of new compound testing, and results from these experiments often decide which compounds go on for further testing. Genomic context plays a critical role in drug response and now genomic data for tumors and cell lines are widely available. However, cell lines are often chosen based on ease of access, literature prevalence, and ease of culture. We combined gene expression and CNV/mutation profiling from four pancreatic cancer tumor datasets (GSE21501, GSE28735, ICGC, TCGA,) and three pancreatic cancer cell line datasets (Klijn et al, Collisson et al, and CCLE) to identify which cell lines best match patient tumors. CNV comparison revealed that popular cell lines do not always have the best CNV correlation with tumors: when comparing pancreatic cancer tumors to cell lines, the citations of the top five cell lines by CNV correlation were less than 10% of the pancreatic cancer cell line total. Next we filtered for driver mutations including SMAD4 and CDKN2A using mutation scoring algorithms and clustered tumors and cell lines. We found that many cell lines with few citation counts clustered readily amongst tumors (such as L33). Leveraging the hypothesis that different hits in the same pathway can have a similar downstream effect, we combined CNV, expression and mutation data and clustered cell lines together with tumors based on overall aberrations in MSigDB cancer pathways. L33 and YAPC clustered near tumors while the majority of other cell lines clustered together. To identify coexpressed gene clusters, we ran WGCNA individually in all seven datasets and discovered modules consistent in cell line and tumor datasets using iGraph. One of the most interesting modules (interferon regulated genes) is expressed highly in the majority of tumors profiled. About half of cell lines also express this module highly, suggesting that they may be more ideal models for high interferon expression tumors than other cell lines. Here we present evidence demonstrating that certain cell lines mimic pancreatic tumor genomes more closely while others represent patterns of genomic features not commonly observed in vivo. We also show that certain biologically relevant tumor subtypes may be better represented by some cell lines than others. Our analysis highlights the emerging role of genomics in advancing the clinical success of therapeutic trials. Citation Format: Yoonjeong Cha, Adam Labradorf, Joseph Perez-Rogers, Brian Haas, Andrew Lysaght, Brian Weiner, Fadi Towfic, Kevin Fowler, Benjamin Zeskind, Sarah Kolitz, Badri Vardarajan, Maxim Artyomov, Rebecca L. Kusko. Leveraging transcriptomic and genomic data to better select models for preclinical oncology therapeutic development to identify cell lines most similar to patient tumors. [abstract]. In: Proceedings of the 107th Annual Meeting of the American Association for Cancer Research; 2016 Apr 16-20; New Orleans, LA. Philadelphia (PA): AACR; Cancer Res 2016;76(14 Suppl):Abstract nr 789.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.