2017
DOI: 10.1038/srep40712
|View full text |Cite
|
Sign up to set email alerts
|

Viral Phylogenomics Using an Alignment-Free Method: A Three-Step Approach to Determine Optimal Length of k-mer

Abstract: The development of rapid, economical genome sequencing has shed new light on the classification of viruses. As of October 2016, the National Center for Biotechnology Information (NCBI) database contained >2 million viral genome sequences and a reference set of ~4000 viral genome sequences that cover a wide range of known viral families. Whole-genome sequences can be used to improve viral classification and provide insight into the viral “tree of life”. However, due to the lack of evolutionary conservation amon… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

2
58
0

Year Published

2018
2018
2024
2024

Publication Types

Select...
6
3

Relationship

0
9

Authors

Journals

citations
Cited by 46 publications
(60 citation statements)
references
References 47 publications
2
58
0
Order By: Relevance
“…The choice of k-mer length is important. Increasing the k-mer size could decrease sensitivity in our case as small variation will significantly change the k-mer composition, whereas lowering k-mer size reduces the features that are discriminative for a pattern 70 . In addition, our embedding size exponentially grows with respect to the k so there is also a practical upper bound on the k. Following Zhang 70 and Dubinkina 71 , we trained and tested in the range 4 ≤ k < 9.…”
Section: Genotyping In Advntr-nnmentioning
confidence: 95%
“…The choice of k-mer length is important. Increasing the k-mer size could decrease sensitivity in our case as small variation will significantly change the k-mer composition, whereas lowering k-mer size reduces the features that are discriminative for a pattern 70 . In addition, our embedding size exponentially grows with respect to the k so there is also a practical upper bound on the k. Following Zhang 70 and Dubinkina 71 , we trained and tested in the range 4 ≤ k < 9.…”
Section: Genotyping In Advntr-nnmentioning
confidence: 95%
“…In Fig 2, F(k) is plotted against the word length k for contiguous words. By inserting the average sequence length L into (14) and (15), we obtain k min = 19 and k max = 24. With these values, we can calculate the slope of F in the relevant range as…”
Section: Methodsmentioning
confidence: 99%
“…Therefore, considerable efforts have been made in recent years, to develop fast alignment-free approaches that can estimate phylogenetic distances without the need to calculate full alignments of the input sequences, see [3][4][5][6][7] for recent review articles. Alignment-free approaches are not only used in phylogeny reconstruction, but are also important in metagenomics [8][9][10], to find genome rearrangements [11] and in epidemiology [12] and other medical applications, for example to identify drug-resistant bacteria [13] or to classify viruses [14,15]. In all these applications, it is crucial to rapidly estimate pairwise similarity or dissimilarity values in large sets of sequence data.Some alignment-free approaches are based on word frequencies [16,17] or on the length of common substrings [18][19][20].…”
mentioning
confidence: 99%
“…Therefore, considerable efforts have been made in recent years, to develop fast alignmentfree approaches that can estimate phylogenetic distances without the need to calculate full alignments of the input sequences, see [20,50,57,4,26] for recent review articles. Alignment-free approaches are not only used in phylogeny reconstruction, but are also important in metagenomics [10,39,32] and in medical applications, for example to identify drug-resistant bacteria [5] or to classify viruses [55,2]. In all these applications, it is crucial to rapidly estimate pairwise similarity or dissimilarity values in large sets of sequence data.…”
Section: Introductionmentioning
confidence: 99%