2018
DOI: 10.1093/bioinformatics/bty175
|View full text |Cite
|
Sign up to set email alerts
|

Understanding sequencing data as compositions: an outlook and review

Abstract: MotivationAlthough seldom acknowledged explicitly, count data generated by sequencing platforms exist as compositions for which the abundance of each component (e.g. gene or transcript) is only coherently interpretable relative to other components within that sample. This property arises from the assay technology itself, whereby the number of counts recorded for each sample is constrained by an arbitrary total sum (i.e. library size). Consequently, sequencing data, as compositional data, exist in a non-Euclide… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
236
1

Year Published

2019
2019
2022
2022

Publication Types

Select...
6
2

Relationship

0
8

Authors

Journals

citations
Cited by 276 publications
(237 citation statements)
references
References 64 publications
0
236
1
Order By: Relevance
“…Most methods for analyzing RNA-Seq expression data assume that raw read counts represent absolute abundances (Quinn, Richardson, Lovell, & Crowley, 2017). However, RNA-Seq count data are not absolute and instead represent relative abundances as a type of compositional count data (Quinn, Erb, Richardson, & Crowley, 2018c;Quinn, Richardson, et al, 2017). Using methods that assume absolute values is invalid for compositional data (without first including a transformation) because the total number of reads (library size) generated from each sample varies based on factors such as sequencing performance, making comparisons of the actual count values between samples difficult (Fernandes et al, 2014;Quinn, Erb, et al, 2018c).…”
Section: Count Filtering and Log-ratio Transformationsmentioning
confidence: 99%
“…Most methods for analyzing RNA-Seq expression data assume that raw read counts represent absolute abundances (Quinn, Richardson, Lovell, & Crowley, 2017). However, RNA-Seq count data are not absolute and instead represent relative abundances as a type of compositional count data (Quinn, Erb, Richardson, & Crowley, 2018c;Quinn, Richardson, et al, 2017). Using methods that assume absolute values is invalid for compositional data (without first including a transformation) because the total number of reads (library size) generated from each sample varies based on factors such as sequencing performance, making comparisons of the actual count values between samples difficult (Fernandes et al, 2014;Quinn, Erb, et al, 2018c).…”
Section: Count Filtering and Log-ratio Transformationsmentioning
confidence: 99%
“…Alternatively, compositional data analysis as a well‐developed body of statistical methodology provides models and methods equivalent to traditional ones yet accounts for these special constraining features of relative data. The approach has been used for decades to analyze analogous types of data in the geosciences (Buccianti et al, ) and, more recently, in other disparate areas such as molecular biology to analyze sequencing data (Quinn et al, ) or physical activity epidemiology for the analysis of daily time‐use patterns (Chastin et al, ; McGregor et al, ). While the statistical theory may be unfamiliar and not typically taught in most statistics courses, recent publications and software have made the use of these techniques both feasible and accessible.…”
Section: Resultsmentioning
confidence: 99%
“…There are many problems associated with the analysis of compositional data that cannot be handled by DMM alone (see Aitchison & Egozcue, , Gloor & Reid, , Quinn, Erb, Richardson, & Crowley, , Tsilimigras & Fodor, , van den Boogaart & Tolosana‐Delgado, ). The most intuitive challenge posed by compositional data is that spurious correlations among features can arise because of the data's inherent covariance structure (Pearson, ).…”
Section: Discussionmentioning
confidence: 99%