Seeding is usually the initial step of high-throughput sequence aligners. Two popular seeding strategies are fixed-size seeding ( -mers, minimizers) and variable-size seeding (MEMs, SMEMs, max. spanning seeds). The former strategy benefits from fast index building and fast seed computation, while the latter one benefits from high seed entropy. Here we build a performant bridge between both strategies and show that neither of them is of theoretical superiority. We propose an algorithmic approach for computing MEMs out of -mers or minimizers. Further, we describe techniques for extracting SMEMs or maximally spanning seeds out of MEMs. A comprehensive benchmarking shows the practical value of the proposed approaches. In this context, we report about the effects and the fine-tuning of occurrence filters for the different seeding strategies.
KEYWORDShigh-throughput sequence alignment, minimizer, SMEM, FMD-Index, seed entropy.
INTRODUCTIONMost high-throughput read aligners [1-5] perform the following three steps: seeding [6, 7], seed processing (e.g. chaining, SoC) [8,9] and dynamic programming [10,11]. There are two techniques for seed computation: fixed-sized seeding [12] and variable-size seeding [13,14]. Fixed-size seeding is usually done via -mers or via their space efficient variant, minimizers [3]. Variable-size seeding, in turn, relies on some form of full-text search index as e.g. the FMD-index [13,14]. Fixed-size seeding benefits from short runtimes for index construction and seed computation, while variable-size seeding benefits from the high entropy of the generated seeds [2,3]. Here, we present an efficient algorithmic bridge for computing variable-size seeds out of fixed-size seeds. Hence, the performant behavior of fixed-size seeds becomes available with variable-size seeds as well.