KMC 3: counting and manipulating <i>k</i>-mer statistics

Kokot, Marek; Dlugosz, Maciej; Deorowicz, Sebastian

doi:10.1093/bioinformatics/btx304

Cited by 524 publications

(450 citation statements)

References 9 publications

Supporting

Mentioning

449

Contrasting

Order By: Relevance

“…HCoV-NL63 genotype A and B sequence sets were prepared from GenBank plus the Kilifi HCoV-NL63 sequences. KMC3 [41] was used to identify all 30-nt sequences (k-mers) present in genotype A sequences and not in genotype B sequences and vice versa. Quality-controlled short read sequences from each sample were then classified as HCoV-NL63 genotype A or genotype B based on the read's content of genotype A and B-specific 30-nt kmers using a threshold of 20 kmer per read as defining identity to a genotype.…”

Section: K-mer Methods Of Genotype Classificationmentioning

confidence: 99%

Human Coronavirus NL63 Molecular Epidemiology and Evolutionary Patterns in Rural Coastal Kenya

Kiyuka

Agoti

Munywoki

et al. 2018

The Journal of Infectious Diseases

128

112

View full text Add to dashboard Cite

BackgroundHuman coronavirus NL63 (HCoV-NL63) is a globally endemic pathogen causing mild and severe respiratory tract infections with reinfections occurring repeatedly throughout a lifetime.MethodsNasal samples were collected in coastal Kenya through community-based and hospital-based surveillance. HCoV-NL63 was detected with multiplex real-time reverse transcription PCR, and positive samples were targeted for nucleotide sequencing of the spike (S) protein. Additionally, paired samples from 25 individuals with evidence of repeat HCoV-NL63 infection were selected for whole-genome virus sequencing.ResultsHCoV-NL63 was detected in 1.3% (75/5573) of child pneumonia admissions. Two HCoV-NL63 genotypes circulated in Kilifi between 2008 and 2014. Full genome sequences formed a monophyletic clade closely related to contemporary HCoV-NL63 from other global locations. An unexpected pattern of repeat infections was observed with some individuals showing higher viral titers during their second infection. Similar patterns for 2 other endemic coronaviruses, HCoV-229E and HCoV-OC43, were observed. Repeat infections by HCoV-NL63 were not accompanied by detectable genotype switching.ConclusionsIn this coastal Kenya setting, HCoV-NL63 exhibited low prevalence in hospital pediatric pneumonia admissions. Clade persistence with low genetic diversity suggest limited immune selection, and absence of detectable clade switching in reinfections indicates initial exposure was insufficient to elicit a protective immune response.

show abstract

Section: K-mer Methods Of Genotype Classificationmentioning

confidence: 99%

Human Coronavirus NL63 Molecular Epidemiology and Evolutionary Patterns in Rural Coastal Kenya

Kiyuka

Agoti

Munywoki

et al. 2018

The Journal of Infectious Diseases

128

112

View full text Add to dashboard Cite

show abstract

“…The VCFs with these variants were then normalized using bcftools norm (1.9) and combined with the SVs across samples using bayesTyperTools combine to produce the input candidate set. k-mers in the raw reads were counted using kmc [41] (3.1.1) with a k-mer size of 55. A Bloom lter was constructed from these k-mers using bayesTyperTools makeBloom .…”

Section: Bayestyper (V15 Beta 62888d6)mentioning

confidence: 99%

Genotyping structural variants in pangenome graphs using the vg toolkit

Hickey

Heller

Monlong

et al. 2019

Preprint

110

View full text Add to dashboard Cite

Structural variants (SVs) remain challenging to represent and study relative to point mutations despite their demonstrated importance. We show that variation graphs, as implemented in the vg toolkit, provide an e ective means for leveraging SV catalogs for short-read SV genotyping experiments. We benchmarked vg against state-of-the-art SV genotypers using three sequence-resolved SV catalogs generated by recent long-read sequencing studies. In addition, we use assemblies from 12 yeast strains to show that graphs constructed directly from aligned de novo assemblies improve genotyping compared to graphs built from intermediate SV catalogs in the VCF format.real Illumina reads and a pangenome built from SVs discovered in recent long-read sequencing studies [21,22,23,5], We also compared vg's performance with state-of-the-art SV genotypers: SVTyper[3], Delly Genotyper[4], BayesTyper[19], Paragraph[20] and . Across the datasets we tested, which range in size from 26k to 97k SVs, vg is the best performing SV genotyper on real short-read data for all SV types in the majority of cases. Finally, we demonstrate that a pangenome graph built from the alignment of de novo assemblies of diverse Saccharomyces cerevisiae strains improves SV genotyping performance. Results Structural variation in vgWe used vg to implement a straightforward SV genotyping pipeline. Reads are mapped to the graph and used to compute the read support for each node and edge (see Supplementary Information for a description of the graph formalism). Sites of variation within the graph are then identi ed using the snarl decomposition as described in [24]. These sites correspond to intervals along the reference paths (ex. contigs or chromosomes) which are embedded in the graph. They also contain nodes and edges deviating from the reference path, which represent variation at the site. For each site, the two most supported paths spanning its interval (haplotypes) are determined, and their relative supports used to produce a genotype at that site (Figure 1a). The pipeline is described in detail in Methods. We rigorously evaluated the accuracy of our method on a variety of datasets, and present these results in the remainder of this section.

show abstract

“…The compression of k-mer sets has not been extensively studied, except in the context of how k-mer counters store their output [17][18][19][20]. DSK [18] uses an HDF5-based encoding, KMC3 [17] combines a dense storage of prefixes with a sparse storage of suffixes, and Squeakr [20] uses a counting quotient filter [21]. The compression of read data, on the other hand, stored in either unaligned or aligned formats, has received a lot of attention [22][23][24].…”

Section: Related Workmentioning

confidence: 99%

“…We measure the compressed space-usage (Table 2), compression time and memory (Table 3), and decompression time and memory. We compare against the following lossless compression strategies: 1) the binary output of the k-mer [18], KMC [17], and Squeakr-exact [20]; 2) the original FASTA sequences, with headers removed; 3) the maximal unitigs; and 4) the BOSS representation [31] (as implemented in COSMO [42]). In all cases, the stored data is additionally compressed using MFC (for nucleotide sequences, i.e.…”

Section: Evaluation Of Ust-compressmentioning

confidence: 99%

Representation ofk-mer sets using spectrum-preserving string sets

Rahman

Medvedev

2020

Preprint

View full text Add to dashboard Cite

Given the popularity and elegance of k-mer based tools, finding a space-efficient way to represent a set of k-mers is important for improving the scalability of bioinformatics analyses. One popular approach is to convert the set of k-mers into the more compact set of unitigs. We generalize this approach and formulate it as the problem of finding a smallest spectrum-preserving string set (SPSS) representation. We show that this problem is equivalent to finding a smallest path cover in a compacted de Bruijn graph. Using this reduction, we prove a lower bound on the size of the optimal SPSS and propose a greedy method called UST that results in a smaller representation than unitigs and is nearly optimal with respect to our lower bound. We demonstrate the usefulness of the SPSS formulation with two applications of UST. The first one is a compression algorithm, UST-Compress, which we show can store a set of k-mers using an order-of-magnitude less disk space than other lossless compression tools. The second one is an exact static k-mer membership index, UST-FM, which we show improves index size by 10-44% compared to other state-of-the-art low memory indices. Our tool is publicly available at: https://github.com/medvedevgroup/UST/.

show abstract

KMC 3: counting and manipulating k-mer statistics

Abstract: Supplementary data are available at Bioinformatics online.

Cited by 524 publications

References 9 publications

Human Coronavirus NL63 Molecular Epidemiology and Evolutionary Patterns in Rural Coastal Kenya

Human Coronavirus NL63 Molecular Epidemiology and Evolutionary Patterns in Rural Coastal Kenya

Genotyping structural variants in pangenome graphs using the vg toolkit

Representation ofk-mer sets using spectrum-preserving string sets

Contact Info

Product

Resources

About