BackgroundCentral precocious puberty (CPP) in girls seriously affects their physical and mental development in childhood. The method of diagnosis—gonadotropin-releasing hormone (GnRH)–stimulation test or GnRH analogue (GnRHa)–stimulation test—is expensive and makes patients uncomfortable due to the need for repeated blood sampling.ObjectiveWe aimed to combine multiple CPP–related features and construct machine learning models to predict response to the GnRHa-stimulation test.MethodsIn this retrospective study, we analyzed clinical and laboratory data of 1757 girls who underwent a GnRHa test in order to develop XGBoost and random forest classifiers for prediction of response to the GnRHa test. The local interpretable model-agnostic explanations (LIME) algorithm was used with the black-box classifiers to increase their interpretability. We measured sensitivity, specificity, and area under receiver operating characteristic (AUC) of the models.ResultsBoth the XGBoost and random forest models achieved good performance in distinguishing between positive and negative responses, with the AUC ranging from 0.88 to 0.90, sensitivity ranging from 77.91% to 77.94%, and specificity ranging from 84.32% to 87.66%. Basal serum luteinizing hormone, follicle-stimulating hormone, and insulin-like growth factor-I levels were found to be the three most important factors. In the interpretable models of LIME, the abovementioned variables made high contributions to the prediction probability.ConclusionsThe prediction models we developed can help diagnose CPP and may be used as a prescreening tool before the GnRHa-stimulation test.
BackgroundHigh-throughtput technologies enable the testing of tens of thousands of measurements simultaneously. Identification of genes that are differentially expressed or associated with clinical outcomes invokes the multiple testing problem. False Discovery Rate (FDR) control is a statistical method used to correct for multiple comparisons for independent or weakly dependent test statistics. Although FDR control is frequently applied to microarray data analysis, gene expression is usually correlated, which might lead to inaccurate estimates. In this paper, we evaluate the accuracy of FDR estimation.MethodsUsing two real data sets, we resampled subgroups of patients and recalculated statistics of interest to illustrate the imprecision of FDR estimation. Next, we generated many simulated data sets with block correlation structures and realistic noise parameters, using the Ultimate Microarray Prediction, Inference, and Reality Engine (UMPIRE) R package. We estimated FDR using a beta-uniform mixture (BUM) model, and examined the variation in FDR estimation.ResultsThe three major sources of variation in FDR estimation are the sample size, correlations among genes, and the true proportion of differentially expressed genes (DEGs). The sample size and proportion of DEGs affect both magnitude and precision of FDR estimation, while the correlation structure mainly affects the variation of the estimated parameters.ConclusionsWe have decomposed various factors that affect FDR estimation, and illustrated the direction and extent of the impact. We found that the proportion of DEGs has a significant impact on FDR; this factor might have been overlooked in previous studies and deserves more thought when controlling FDR.
Matching genes across microarray platforms is a critical step in meta-analysis. Standard practice uses UniGene to match genes. Numerous studies have found poor correlations between platforms when using UniGene matching. We profiled samples from 33 breast cancer patients on two different microarray platforms (Affymetrix and cDNA) and investigated gene matching. Our results confirmed that UniGene-based matching led to poor correlations of gene expression between platforms. Using RefSeq, a database maintained by the National Center for Biotechnology Information (NCBI), we developed and implemented a new method to refine gene matching. We found that the correlations between gene expression measurements were substantially higher after the RefSeq matching. Our approach differs from previously reported sequence-matching approaches and retains useful expression measurements. It is a sensible approach for matching probes across platforms. We conclude that UniGene alone is insufficient to match genes across platforms. Refined matching based on RefSeq significantly improves the quality of matches.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.