Motivation: In light of the increasing adoption of targeted resequencing (TR) as a cost-effective strategy to identify disease-causing variants, a robust method for copy number variation (CNV) analysis is needed to maximize the value of this promising technology.Results: We present a method for CNV detection for TR data, including whole-exome capture data. Our method calls copy number gains and losses for each target region based on normalized depth of coverage. Our key strategies include the use of base-level log-ratios to remove GC-content bias, correction for an imbalanced library size effect on log-ratios, and the estimation of log-ratio variations via binning and interpolation. Our methods are made available via CONTRA (COpy Number Targeted Resequencing Analysis), a software package that takes standard alignment formats (BAM/SAM) and outputs in variant call format (VCF4.0), for easy integration with other next-generation sequencing analysis packages. We assessed our methods using samples from seven different target enrichment assays, and evaluated our results using simulated data and real germline data with known CNV genotypes.Availability and implementation: Source code and sample data are freely available under GNU license (GPLv3) at http://contra-cnv.sourceforge.net/Contact: Jason.Li@petermac.orgSupplementary information: Supplementary data are available at Bioinformatics online.
The growing self-organizing map (GSOM) has been presented as an extended version of the self-organizing map (SOM), which has significant advantages for knowledge discovery applications. In this paper, the GSOM algorithm is presented in detail and the effect of a spread factor, which can be used to measure and control the spread of the GSOM, is investigated. The spread factor is independent of the dimensionality of the data and as such can be used as a controlling measure for generating maps with different dimensionality, which can then be compared and analyzed with better accuracy. The spread factor is also presented as a method of achieving hierarchical clustering of a data set with the GSOM. Such hierarchical clustering allows the data analyst to identify significant and interesting clusters at a higher level of the hierarchy, and as such continue with finer clustering of only the interesting clusters. Therefore, only a small map is created in the beginning with a low spread factor, which can be generated for even a very large data set. Further analysis is conducted on selected sections of the data and as such of smaller volume. Therefore, this method facilitates the analysis of even very large data sets.
Quantitative gait assessment is important in diagnosis and management of Parkinson's disease (PD); however, gait characteristics of a cohort are dispersed by patient physical properties including age, height, body mass, and gender, as well as walking speed, which may limit capacity to discern some pathological features. The aim of this study was twofold. First, to use a multiple regression normalization strategy that accounts for subject age, height, body mass, gender, and self-selected walking speed to identify differences in spatial-temporal gait features between PD patients and controls; and second, to evaluate the effectiveness of machine learning strategies in classifying PD gait after gait normalization. Spatial-temporal gait data during self-selected walking were obtained from 23 PD patients and 26 aged-matched controls. Data were normalized using standard dimensionless equations and multiple regression normalization. Machine learning strategies were then employed to classify PD gait using the raw gait data, data normalized using dimensionless equations, and data normalized using the multiple regression approach. After normalizing data using the dimensionless equations, only stride length, step length, and double support time were significantly different between PD patients and controls (p < 0.05); however, normalizing data using the multiple regression method revealed significant differences in stride length, cadence, stance time, and double support time. Random Forest resulted in a PD classification accuracy of 92.6% after normalizing gait data using the multiple regression approach, compared to 80.4% (support vector machine) and 86.2% (kernel Fisher discriminant) using raw data and data normalized using dimensionless equations, respectively. Our multiple regression normalization approach will assist in diagnosis and treatment of PD using spatial-temporal gait data.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.