2023
DOI: 10.1101/2023.03.05.530358
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

LSTrAP-denovo: Automated Generation of Transcriptome Atlases for Eukaryotic Species Without Genomes

Abstract: Structured AbstractMotivationDespite the abundance of species with transcriptomic data, a significant number of the species still lack genomes, making it difficult to study gene function and expression in these organisms. Whilede novotranscriptome assembly can be used to assemble protein-coding transcripts from RNA-sequencing (RNA-seq) data, the datasets used often only feature samples of arbitrarily-selected or similar experimental conditions which might fail to capture condition-specific transcripts.ResultsW… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

0
2
0

Year Published

2024
2024
2024
2024

Publication Types

Select...
1

Relationship

0
1

Authors

Journals

citations
Cited by 1 publication
(2 citation statements)
references
References 61 publications
0
2
0
Order By: Relevance
“…Transcripts per million (TPM) values of primary transcripts of RNA-seq samples estimated by Kallisto were used as gene expression abundances to construct gene expression matrices for the different species. RNA-seq samples with poor read alignment were identified based its %_pseudoaligned statistic (reported by Kallisto) below a certain threshold (5% for H. sapiens; 20% for S. cerevisiae and A. thaliana) as described [90][91][92] (Table S10 to 12). After removal RNA-seq samples with poor read alignment, gene expression matrices comprising of gene expression abundances from 150,096 (79.8%), 57,759 (87.5%), and 95,553 (66.9%) retained RNA-seq samples were designated as the public transcriptomic datasets for S. cerevisiae, A. thaliana and H. sapiens, respectively.…”
Section: Generation and Preprocessing Of S Cerevisiae A Thaliana And ...mentioning
confidence: 99%
See 1 more Smart Citation
“…Transcripts per million (TPM) values of primary transcripts of RNA-seq samples estimated by Kallisto were used as gene expression abundances to construct gene expression matrices for the different species. RNA-seq samples with poor read alignment were identified based its %_pseudoaligned statistic (reported by Kallisto) below a certain threshold (5% for H. sapiens; 20% for S. cerevisiae and A. thaliana) as described [90][91][92] (Table S10 to 12). After removal RNA-seq samples with poor read alignment, gene expression matrices comprising of gene expression abundances from 150,096 (79.8%), 57,759 (87.5%), and 95,553 (66.9%) retained RNA-seq samples were designated as the public transcriptomic datasets for S. cerevisiae, A. thaliana and H. sapiens, respectively.…”
Section: Generation and Preprocessing Of S Cerevisiae A Thaliana And ...mentioning
confidence: 99%
“…TPM values of each RNA-seq sample in the gene expression matrices were standardized to unit variance (within-sample standardization) to control for batch effects at the sample level as described 33,90 . Next, the standardized gene expression values were subjected to dimension reduction using Principal Component Analysis (PCA; decomposition.IncrementalPCA function from scikit-learn version 1.3.0) where only the first 1,000 Principal components (PCs) were used as high-level features to embed RNA-seq samples in PC space (PC embeddings) that describes their relative transcriptome similarity between each other.…”
Section: Dataset Partitioningmentioning
confidence: 99%