2008
DOI: 10.1126/science.1151532
|View full text |Cite
|
Sign up to set email alerts
|

Alignment Uncertainty and Genomic Analysis

Abstract: The statistical methods applied to the analysis of genomic data do not account for uncertainty in the sequence alignment. Indeed, the alignment is treated as an observation, and all of the subsequent inferences depend on the alignment being correct. This may not have been too problematic for many phylogenetic studies, in which the gene is carefully chosen for, among other things, ease of alignment. However, in a comparative genomics study, the same statistical methods are applied repeatedly on thousands of gen… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

8
290
0
2

Year Published

2009
2009
2021
2021

Publication Types

Select...
7
2
1

Relationship

0
10

Authors

Journals

citations
Cited by 365 publications
(300 citation statements)
references
References 23 publications
8
290
0
2
Order By: Relevance
“…We used Spearman's rank correlation to determine if the following characteristics of the sequence data were correlated with the P values from the LRTs: (i) average GC content at the third position; (ii) average overall GC content; (iii) transition/transversion ratio (kappa); and (iv) d N tree length. The gappiness of an alignment could introduce potential biases in our results (17,18), so we also looked for correlations between the P values from the LRTs and two metrics to assess coverage in our alignments: (i) gap percent (gapPCT), or the sum of the number of gaps in each sequence in an alignment divided by the sum of the total number of sites in all of the sequences in an alignment; and (ii) an alignment quality score (described in SI Text). Only a few of these characteristics of the data were significantly correlated (P < 0.05) with the P values of the LRTs, but all correlations were very weak (range of Spearman's rho = −0.1-0.06, for all tests; Dataset S2).…”
Section: Heterogeneous Patterns Of Molecular Evolution Among Bee Linementioning
confidence: 99%
“…We used Spearman's rank correlation to determine if the following characteristics of the sequence data were correlated with the P values from the LRTs: (i) average GC content at the third position; (ii) average overall GC content; (iii) transition/transversion ratio (kappa); and (iv) d N tree length. The gappiness of an alignment could introduce potential biases in our results (17,18), so we also looked for correlations between the P values from the LRTs and two metrics to assess coverage in our alignments: (i) gap percent (gapPCT), or the sum of the number of gaps in each sequence in an alignment divided by the sum of the total number of sites in all of the sequences in an alignment; and (ii) an alignment quality score (described in SI Text). Only a few of these characteristics of the data were significantly correlated (P < 0.05) with the P values of the LRTs, but all correlations were very weak (range of Spearman's rho = −0.1-0.06, for all tests; Dataset S2).…”
Section: Heterogeneous Patterns Of Molecular Evolution Among Bee Linementioning
confidence: 99%
“…Accurate multiple sequence alignment is a fundamental step in recovering a reliable phylogeny (Mullan 2002;Wong et al 2008). In theory, the order in which residues are aligned (i.e., amino-to-carboxy or carboxy-to-amino direction) should yield identical sequence alignments.…”
Section: Effect Of Alignment Orientation On Phylogenetic Supertree Rementioning
confidence: 99%
“…The interaction between MSA and the accuracy of phylogenetic inference continues to be a major source of bias and uncertainty in phylogenetic and phylogenomic studies (Wong et al 2008; Hossain et al The improved accuracy in pairwise distance estimates carries through to tree inference, with the trees calculated from our pair-HMM inferred distance matrix being more accurate than standard pairwise methods the overwhelming majority of the time, especially at larger evolutionary distances. For closely related sequences, this improvement may be attributable to incorporating indel information, whereas for more distantly related sequences the incorporation of alignment uncertainty is also very important.…”
Section: Discussionmentioning
confidence: 97%