2022
DOI: 10.48550/arxiv.2207.00554
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Inference after latent variable estimation for single-cell RNA sequencing data

Abstract: In the analysis of single-cell RNA sequencing data, researchers often characterize the variation between cells by estimating a latent variable, such as cell type or pseudotime, representing some aspect of the individual cell's state. They then test each gene for association with the estimated latent variable. If the same data are used for both of these steps, then standard methods for computing p-values and confidence intervals in the second step will fail to achieve statistical guarantees such as Type 1 error… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
6
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
4

Relationship

0
4

Authors

Journals

citations
Cited by 4 publications
(6 citation statements)
references
References 18 publications
0
6
0
Order By: Relevance
“…We first, created a new open source python-based bootstrapping implementation of the count-splitting method 8 , employing it to create three independent measures for training-, validation-, and test-sets that host each cell and gene in every split. Note that this differs from standard machine learning (ML) practice in which separate samples (cells in scRNAseq) would be used for training, validation, and test, looking for replicated characteristics in different populations.…”
Section: Resultsmentioning
confidence: 99%
See 1 more Smart Citation
“…We first, created a new open source python-based bootstrapping implementation of the count-splitting method 8 , employing it to create three independent measures for training-, validation-, and test-sets that host each cell and gene in every split. Note that this differs from standard machine learning (ML) practice in which separate samples (cells in scRNAseq) would be used for training, validation, and test, looking for replicated characteristics in different populations.…”
Section: Resultsmentioning
confidence: 99%
“…Here we present a count-splitting approach inspired by Neufeld et al 8 , that uses a training-set for topology building and clustering, a validation-set to optimize the training-clusters (or other topology-associated analysis), and a final test-set for DEG analysis. These splits enabled our creation of a new cluster validation algorithm and also allowed us to develop the first self-supervised benchmark of scRNAseq analysis pipelines.…”
Section: Discussionmentioning
confidence: 99%
“…To mitigate double-dipping, a Poisson/negative binomial sampling of the original data into independent partitions of the same dimensions was introduced in an approach called count splitting [Neufeld et al, 2022[Neufeld et al, , 2023. The first partition can be used for latent-variable estimation and the second partition for inference.…”
Section: Discussionmentioning
confidence: 99%
“…To characterize previously unknown cell types, automatic selection of signature genes for each cluster is often achieved through differential expression (DE) testing [24]. For this task, BacSC provides capabilities for DE testing that takes the recently popularized problem of "double dipping" for DE testing of cell types into account [29,45,46]. In short, using the same information (gene expression) to define a clustering as well as the subsequently determining DE genes to characterize these clusters results in an inflated false discovery rate (FDR).…”
Section: Description Of the Bacsc Pipelinementioning
confidence: 99%