Viral Phylogenomics Using an Alignment-Free Method: A Three-Step Approach to Determine Optimal Length of k-mer

Zhang, Qian; Jun, Se‐Ran; Leuze, Michael R.; Ussery, David W.; Nookaew, Intawat

doi:10.1038/srep40712

“…The choice of k-mer length is important. Increasing the k-mer size could decrease sensitivity in our case as small variation will significantly change the k-mer composition, whereas lowering k-mer size reduces the features that are discriminative for a pattern 70 . In addition, our embedding size exponentially grows with respect to the k so there is also a practical upper bound on the k. Following Zhang 70 and Dubinkina 71 , we trained and tested in the range 4 ≤ k < 9.…”

Section: Genotyping In Advntr-nnmentioning

confidence: 95%

Variable Number Tandem Repeats mediate the expression of proximal genes

Bakhtiari

¹

,

Park

²

,

Ding

³

et al. 2020

Preprint

View full text Add to dashboard Cite

Variable Number Tandem Repeats (VNTRs) account for a significant amount of human genetic variation. VNTRs have been implicated in both Mendelian and Complex disorders, but are largely ignored by whole genome analysis pipelines due to the complexity of genotyping and the computational expense. We describe adVNTR-NN, a method that uses shallow neural networks for fast read recruitment. On 55X whole genome data, adVNTR-NN genotyped each VNTR in less than 18 cpu-seconds, while maintaining 100% accuracy on 76% of VNTRs.We used adVNTR-NN to genotype 10,264 VNTRs in 652 individuals from the GTEx project and associated VNTR length with gene expression in 46 tissues. We identified 163 'eVNTR' loci that were significantly associated with gene expression. Of the 22 eVNTRs in blood where independent data was available, 21 (95%) were replicated in terms of significance and direction of association. 49% of the eVNTR loci showed a strong and likely causal impact on the expression of genes and 80% had maximum effect size at least 0.3. The impacted genes have important role in complex phenotypes including Alzheimer's, obesity and familial cancers. Our results point to the importance of studying VNTRs for understanding the genetic basis of complex diseases.

show abstract

“…In Fig 2, F(k) is plotted against the word length k for contiguous words. By inserting the average sequence length L into (14) and (15), we obtain k min = 19 and k max = 24. With these values, we can calculate the slope of F in the relevant range as…”

Section: Methodsmentioning

confidence: 99%

“…Therefore, considerable efforts have been made in recent years, to develop fast alignment-free approaches that can estimate phylogenetic distances without the need to calculate full alignments of the input sequences, see [3][4][5][6][7] for recent review articles. Alignment-free approaches are not only used in phylogeny reconstruction, but are also important in metagenomics [8][9][10], to find genome rearrangements [11] and in epidemiology [12] and other medical applications, for example to identify drug-resistant bacteria [13] or to classify viruses [14,15]. In all these applications, it is crucial to rapidly estimate pairwise similarity or dissimilarity values in large sets of sequence data.Some alignment-free approaches are based on word frequencies [16,17] or on the length of common substrings [18][19][20].…”

mentioning

confidence: 99%

The number of k-mer matches between two DNA sequences as a function of k and applications to estimate phylogenetic distances

Röhling

¹

,

Linne

²

,

Schellhorn

³

et al. 2020

View full text Add to dashboard Cite

We study the number N k of length-k word matches between pairs of evolutionarily related DNA sequences, as a function of k. We show that the Jukes-Cantor distance between two genome sequences-i.e. the number of substitutions per site that occurred since they evolved from their last common ancestor-can be estimated from the slope of a function F that depends on N k and that is affine-linear within a certain range of k. Integers k min and k max can be calculated depending on the length of the input sequences, such that the slope of F in the relevant range can be estimated from the values F(k min ) and F(k max ). This approach can be generalized to so-called Spaced-word Matches (SpaM), where mismatches are allowed at positions specified by a user-defined binary pattern. Based on these theoretical results, we implemented a prototype software program for alignment-free sequence comparison called Slope-SpaM. Test runs on real and simulated sequence data show that Slope-SpaM can accurately estimate phylogenetic distances for distances up to around 0.5 substitutions per position. The statistical stability of our results is improved if spaced words are used instead of contiguous words. Unlike previous alignment-free methods that are based on the number of (spaced) word matches, Slope-SpaM produces accurate results, even if sequences share only local homologies. OPEN ACCESSCitation: Röhling S, Linne A, Schellhorn J, Hosseini M, Dencker T, Morgenstern B (2020) The number of k-mer matches between two DNA sequences as a function of k and applications to estimate phylogenetic distances. PLoS ONE 15(2): e0228070. https://doi.org/10. Data Availability Statement:The source code of our software is freely available through GitHub Traditionally, phylogenetic distances are inferred from pairwise or multiple sequence alignments. For the huge amounts of sequence data that are now available, however, sequence alignment has become too slow. Therefore, considerable efforts have been made in recent years, to develop fast alignment-free approaches that can estimate phylogenetic distances without the need to calculate full alignments of the input sequences, see [3][4][5][6][7] for recent review articles. Alignment-free approaches are not only used in phylogeny reconstruction, but are also important in metagenomics [8][9][10], to find genome rearrangements [11] and in epidemiology [12] and other medical applications, for example to identify drug-resistant bacteria [13] or to classify viruses [14,15]. In all these applications, it is crucial to rapidly estimate pairwise similarity or dissimilarity values in large sets of sequence data.Some alignment-free approaches are based on word frequencies [16,17] or on the length of common substrings [18][19][20]. Other methods use variants of the D 2 distance which is defined as the number of word matches of a pre-defined length between two sequences [15,[21][22][23]; a review focusing on these methods is given in [24]. kWIP [25] is a further development of this concept that uses information-theoretical...

show abstract

“…Therefore, considerable efforts have been made in recent years, to develop fast alignmentfree approaches that can estimate phylogenetic distances without the need to calculate full alignments of the input sequences, see [20,50,57,4,26] for recent review articles. Alignment-free approaches are not only used in phylogeny reconstruction, but are also important in metagenomics [10,39,32] and in medical applications, for example to identify drug-resistant bacteria [5] or to classify viruses [55,2]. In all these applications, it is crucial to rapidly estimate pairwise similarity or dissimilarity values in large sets of sequence data.…”

Section: Introductionmentioning

confidence: 99%

The number of spaced-word matches between twoDNAsequences as a function of the underlying pattern weight

Röhling

¹

,

Dencker

²

,

Morgenstern

³

2019

Preprint

View full text Add to dashboard Cite

We study the number N k of length-k word matches between pairs of evolutionarily related DNA sequences, as a function of k. We show that the Jukes-Cantor distance between two genome sequences -i.e. the number of substitutions per site that occurred since they evolved from their last common ancestor -can be estimated from the slope of a function F that depends on N k and that is affine-linear within a certain range of k. Integers k min and k max can be calculated depending on the length of the input sequences, such that the slope of F in the relevant range can be estimated from the values F (k min ) and F (k max ). This approach can be generalized to so-called spaced-word matches, where mismatches are allowed at positions specified by a user-defined binary pattern. Based on these theoretical results, we implemented a 1 software program for alignment-free sequence comparison called Slope-SpaM. Test runs on real and simulated sequence data show that Slope-SpaM can accurately estimate phylogenetic distances for distances up to around 0.5 substitutions per position. The statistical stability of our results is improved if spaced words are used instead of contiguous words. Unlike previous alignment-free methods that are based on the number of (spaced) word matches, Slope-SpaM produces accurate results, even if sequences share only local homologies.

show abstract

Viral Phylogenomics Using an Alignment-Free Method: A Three-Step Approach to Determine Optimal Length of k-mer

Cited by 46 publications

References 47 publications

Variable Number Tandem Repeats mediate the expression of proximal genes

Variable Number Tandem Repeats mediate the expression of proximal genes

The number of k-mer matches between two DNA sequences as a function of k and applications to estimate phylogenetic distances

The number of spaced-word matches between twoDNAsequences as a function of the underlying pattern weight

Contact Info

Product

Resources

About