Motivation: DNA methylation is a biological process impacting the gene functions without changing the underlying DNA sequence. The DNA methylation machinery usually attaches methyl groups to some specific cytosine residues, which modify the chromatin architectures. Such modifications in the promoter regions will inactivate some tumor-suppressor genes. DNA methylation within the coding region may significantly reduce the transcription elongation efficiency. The gene function may be tuned through some cytosines are methylated. Methods: This study hypothesizes that the overall methylation level across a gene may have a better association with the sample labels like diseases than the methylations of individual cytosines. The gene methylation level is formulated as a regression model using the methylation levels of all the cytosines within this gene. A comprehensive evaluation of various feature selection algorithms and classification algorithms is carried out between the gene-level and residue-level methylation levels. Results: A comprehensive evaluation was conducted to compare the gene and cytosine methylation levels for their associations with the sample labels and classification performances. The unsupervised clustering was also improved using the gene methylation levels. Some genes demonstrated statistically significant associations with the class label, even when no residue-level methylation features have statistically significant associations with the class label. So in summary, the trained gene methylation levels improved various methylome-based machine learning models. Both methodology development of regression algorithms and experimental validation of the gene-level methylation biomarkers are worth of further investigations in the future studies. The source code, example data files and manual are available at http://www.healthinformaticslab.org/supp/.
Controlling total mRNA content differences between cell populations is critical in comparative transcriptomic measurements. Due to poor compatibility with ERCC, a good control for droplet-based scRNA-seq is yet to be discovered. Normalizing cells to a common count distribution has been adopted as a silent compromise. Such practice profoundly confounds downstream analysis and mislead discoveries. We present TOMAS, a computational framework that derives total mRNA content ratios between cell populations via deconvoluting their heterotypic doublets. Experiments showed that cell types can have total mRNA differences by many folds and TOMAS can accurately infer the ratios between them. We demonstrate that TOMAS corrects bias in downstream analysis and rectifies a plethora of previously counter-intuitive or inconclusive analytical results. We argue against the opinion that doublets are undesired scale-limiting factors and revealed the unique value of doublets as controls in scRNA-seq. We advocate for their essential role in future large-scale scRNA-seq experiments.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.