Effective sequence similarity detection with strobemers

Sahlin, Kristoffer

doi:10.1101/gr.275648.121

Cited by 61 publications

(193 citation statements)

References 71 publications

Supporting

Mentioning

193

Contrasting

Order By: Relevance

“…The main idea of the seeding approach presented here is to first compute open syncmers (21) from the reference sequences, then link the syncmers together using the randstrobe method (22) with two strobes. The study introducing strobemers (22) described strobemers as linking together strobes in ‘sequence-space’, i.e ., over the set of all k-mers. Since syncmers represent a subset of k-mers from the original sequence, computing randstrobes over this subset of strings is very fast; it suffice to compare a smaller set of syncmers to produce the next strobe, while still having a similar range on the upper and lower window bounds on the original sequence.…”

Section: Methodsmentioning

confidence: 99%

“…This means that s 1 , s 2 , and s ′ are syncmers, and we will let [ w min , w max ] refer to the lower and upper number of syncmers downstream from s 1 where we will sample s 2 from. A second modification to the strobemers as described in (22) is that we store the strobemer hash value from two strobes s 1 and s 2 as H ( s 1 , s 2 ) = v ( s 1 ) / 2 + v ( s 2 ) / 2. The hash function H is symmetric ( h ( v ( s 1 ), v ( s 2 )) = h ( v ( s 2 ), v ( s 1 ))) and together with canonical syncmers it produces the same hash value if the strobemer is created from forward and reverse complement direction.…”

Section: Methodsmentioning

confidence: 99%

“…The hash function H is symmetric ( h ( v ( s 1 ), v ( s 2 )) = h ( v ( s 2 ), v ( s 1 ))) and together with canonical syncmers it produces the same hash value if the strobemer is created from forward and reverse complement direction. It is stated in (22) that a symmetrical hash function is undesirable for mapping due to unnecessary hash collisions. However, when masking highly repetitive seeds as commonly performed in aligners (17), it turns out that a symmetrical hash function helps to avoid sub-optimal alignments, and we will now describe why.…”

Section: Methodsmentioning

confidence: 99%

“…Assume we would use an asymmetric hash function, such as v ( s 1 )/2 + v ( s 2 )/3 proposed in (22). Also assume that strobemer seeds ( s 1 , s 2 ) and ( s 2 , s 1 ) are both found in forward orientation the reference due to, e.g ., inversions.…”

Section: Methodsmentioning

confidence: 99%

“…Here we show that syncmers and strobemers can be used in combination in what becomes a high-speed indexing method, roughly corresponding to the speed of computing minimizers. Our technique is based on first subsampling k -mers from the reference sequences by computing canonical open syncmers (21), then, producing strobemers (22) formed from linking together syncmers occurring close-by on the reference using the randstrobe method. A consequence is that instead of using a single seed (e.g., k =21 as default in minimap2 for short-read mapping), we show that we can link together two syncmers as a strobemer seed and achieve similar accuracy to using individual minimizers.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Flexible seed size enables ultra-fast and accurate read alignment

Sahlin

2021

Preprint

Self Cite

View full text Add to dashboard Cite

Short-read genome alignment is a fundamental computational step used in many bioinformatic analyses. It is therefore desirable to align such data as fast as possible. Most alignment algorithms consider a seed-and-extend approach. Several popular programs perform the seeding step based on the Burrows-Wheeler Transform with a low memory footprint, but they are relatively slow compared to more recent approaches that use a minimizer-based seeding-and-chaining strategy. Recently, syncmers and strobemers were proposed for sequence comparison. Both protocols were designed for improved conservation of matches between sequences under mutations. Syncmers is a thinning protocol proposed as an alternative to minimizers, while strobemers is a linking protocol for gapped sequences and was proposed as an alternative to k-mers. The main contribution in this work is a new seeding approach that combines syncmers and strobemers. We use a strobemer protocol (randstrobes) to link together syncmers (i.e., in syncmer-space) instead of over the original sequence. Our protocol allows us to create longer seeds while preserving mapping accuracy. A longer seed length reduces the number of candidate regions which allows faster mapping and alignment. We also contribute the insight that speed-wise, this protocol is particularly effective when syncmers are canonical. Canonical syncmers can be created for specific parameter combinations and reduce the computational burden of computing the non-canonical randstrobes in reverse complement. We implement our idea in a proof-of-concept short-read aligner strobealign that aligns short reads 3-4x faster than minimap2 and 15-23x faster than BWA and Bowtie2. Many implementation versions of, e.g., BWA, achieve high speed on specific hardware. Our contribution is algorithmic and requires no hardware architecture or system-specific instructions. Strobealign is available at https://github.com/ksahlin/StrobeAlign.

show abstract

Section: Methodsmentioning

confidence: 99%

Section: Methodsmentioning

confidence: 99%