Abstract. In this manuscript, we present an optimized and parallel version of our previous work IMSAME, an exhaustive gapped aligner for the pairwise and accurate comparison of metagenomes. Parallelization strategies are applied to take advantage of modern multiprocessor architectures. In addition, sequential optimizations in CPU time and memory consumption are provided. These algorithmic and computational enhancements enable IMSAME to calculate near optimal alignments which are used to directly assess similarity between metagenomes without requiring reference databases. We show that the overall efficiency of the parallel implementation is superior to 80% while retaining scalability as the number of parallel cores used increases. Moreover, we also show that sequential optimizations yield up to 8x speedup for scenarios with larger data.
BackgroundA metagenome is defined as a collection of genetic material directly recovered from the environment. In particular, a metagenome is composed of a large number of reads (DNA strings) drawn from the species present in the original population. To this day, the field of comparative metagenomics has become big-data driven [1] due to new technological improvements in high-throughput sequencing. However, the analysis of large metagenomic datasets represents a computational challenge and poses several processing bottlenecks, specially to sequence comparison algorithms.Traditional metagenomics comparison involve intermediate pairwise (and individual) comparisons against a reference database. This procedure allows to extract a mapping distribution between reads and species, and thus enables to later on compare these distributions. A similarity measure can then be computed from the two distributions. However, due to the unknown and complex composition Corresponding author.