COT: an efficient and accurate method for detecting marker genes among many subtypes

Lu, Yingzhou; Chao, Wu; Parker, Sarah J.; Cheng, Zuolin; Saylor, Georgia; Eyk, Jennifer E. Van; Yu, Guoqiang; Clarke, Robert; Herrington, David M.; Wang, Yue

doi:10.1093/bioadv/vbac037

Cited by 10 publications

(15 citation statements)

References 11 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We evaluated the performance of MGpI and eCOT in comparison with representative or standard peer methods 5,6 . The evaluation does not include uniHM because it is a subjective visualization tool.…”

Section: Resultsmentioning

confidence: 99%

“…Ideally, a signature gene among molecularly distinct groups would be either uniquely expressed or silent in the group of interest but in no others 4 . However, test statistics used by most existing methods do not satisfy exactly this signature definition and are theoretically prone to detecting imprecise signatures 5 . Furthermore, while a typical heatmap design is visually effective, the common reference origin for expression measurements is altered by the classical standardization, with zero-expression replaced by floating negative values for different genes.…”

Section: Introductionmentioning

confidence: 99%

“…Here we present ABDS tool suite assembled specifically for analyzing biologically diverse samples. Open-source R package includes three fundamental and interrelated analytic tools, namely, mechanism-integrated group-wise pre-imputation (MGpI), extended cosinebased one-sample test (eCOT) 5 , and unified heatmap design (uniHM). Collectively, we propose a hybrid imputation strategy to impute informative missingness associated with signature genes (SG), a cosine-score test to detect downregulated signature genes (DSG), and a unified heatmap design to comparably display multiple differential groups (Fig.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

ABDS: a bioinformatics tool suite for analyzing biologically diverse samples

Du,

Bhardwaj,

et al. 2024

Preprint

Self Cite

View full text Add to dashboard Cite

Bioinformatics software tools are essential to identify informative molecular features that define different phenotypic sample groups. Among the most fundamental and interrelated tasks are missing value imputation, signature gene detection, and differential pattern visualization. However, many commonly used analytics tools can be problematic when handling biologically diverse samples if either informative missingness possess high missing rates with mixed missing mechanisms, or multiple sample groups are compared and visualized in parallel. We developed the ABDS tool suite specifically for analyzing biologically diverse samples. Collectively, a mechanism-integrated group-wise pre-imputation scheme is proposed to retain informative missingness associated with signature genes, a cosine-based one-sample test is extended to detect group-silenced signature genes, and a unified heatmap is designed to display multiple sample groups. We describe the methodological principles and demonstrate the effectiveness of three analytics tools under targeted scenarios, supported by comparative evaluations and biomedical showcases. As an open-source R package, ABDS tool suite complements rather than replaces existing tools and will allow biologists to more accurately detect interpretable molecular signals among phenotypically diverse sample groups.

show abstract

Section: Resultsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

ABDS: a bioinformatics tool suite for analyzing biologically diverse samples

Du,

Bhardwaj,

et al. 2024

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…In order to protect health information and improve reproducibility in research, synthetic data has drawn mainstream attention in the healthcare industry [54,55]. Many labs and companies have harnessed the tools of big data and advanced computation tools to produce large quantities of synthetic data [56]. Modeled after patient data, synthetic data generation is essential to understanding diseases while maintaining patient confidentiality and privacy simultaneously [57].…”

Section: Healthcarementioning

confidence: 99%

Machine Learning for Synthetic Data Generation: a Review

Lu¹,

Wang²,

Wei³

2023

Preprint

View full text Add to dashboard Cite

Data plays a crucial role in machine learning. However, in real-world applications, there are several problems with data, e.g., data are of low quality; a limited number of data points lead to under-fitting of the machine learning model; it is hard to access the data due to privacy, safety and regulatory concerns. Synthetic data generation offers a promising new avenue, as it can be shared and used in ways that real-world data cannot. This paper systematically reviews the existing works that leverage machine learning models for synthetic data generation. Specifically, we discuss the synthetic data generation works from several perspectives: (i) applications, including computer vision, speech, natural language, healthcare, and business; (ii) machine learning methods, particularly neural network architectures and deep generative models; (iii) privacy and fairness issue. In addition, we identify the challenges and opportunities in this emerging field and suggest future research directions.

show abstract

“…As a novel normalization-by-testing strategy ( Evans et al , 2018 ), the Cosbin framework makes no assumption that the total expression is the same or that differential expression across differential experimental conditions is approximately symmetrical and thus complements rather than replaces existing methods ( Evans et al , 2018 ; Johnson and Krishnan 2022 ). One additional benefit of the Cosbin tool is the concurrent detection of marker genes (MGs) and iCEGs ( Lu et al , 2022 ) ( Supplementary Information ).…”

Section: Introductionmentioning

confidence: 99%

Cosbin: cosine score-based iterative normalization of biologically diverse samples

Chao

Cheng

et al. 2022

Bioinformatics Advances

Self Cite

View full text Add to dashboard Cite

Motivation Data normalization is essential to ensure accurate inference and comparability of gene expression measures across samples or conditions. Ideally, gene expression data should be rescaled based on consistently expressed reference genes. However, to normalize biologically diverse samples, most commonly used reference genes exhibit striking expression variability, and size-factor or distribution-based normalization methods can be problematic when the amount of asymmetry in differential expression is significant. Results We report an efficient and accurate data-driven method—Cosine score based iterative normalization (Cosbin) - to normalize biologically diverse samples. Based on the Cosine scores of cross-condition expression patterns, the Cosbin pipeline iteratively eliminates asymmetric differentially expressed genes, identifies consistently expressed genes, and calculates sample-wise normalization factors. We demonstrate the superior performance and enhanced utility of Cosbin compared with six representative peer methods using both simulation and real multi-omics expression datasets. Implemented in open-source R scripts and specifically designed to address normalization bias due to significant asymmetry in differential expression across multiple conditions, the Cosbin tool complements rather than replaces the existing methods and will allow biologists to more accurately detect true molecular signals among diverse phenotypic groups. Availability The R Scripts of Cosbin pipeline is freely available at https://github.com/MinjieSh/Cosbin. Supplementary information Supplementary data are available at Bioinformatics Advances online.

show abstract

COT: an efficient and accurate method for detecting marker genes among many subtypes

Cited by 10 publications

References 11 publications

ABDS: a bioinformatics tool suite for analyzing biologically diverse samples

ABDS: a bioinformatics tool suite for analyzing biologically diverse samples

Machine Learning for Synthetic Data Generation: a Review

Cosbin: cosine score-based iterative normalization of biologically diverse samples

Contact Info

Product

Resources

About