Mapping short reads against a reference genome is classically the first step of many nextgeneration sequencing data analyses, and it should be as accurate as possible. Because of the large number of reads to handle, numerous sophisticated algorithms have been developped in the last 3 years to tackle this problem. In this article, we first review the underlying algorithms used in most of the existing mapping tools, and then we compare the performance of nine of these tools on a well controled benchmark built for this purpose. We built a set of reads that exist in single or multiple copies in a reference genome and for which there is no mismatch, and a set of reads with three mismatches. We considered as reference genome both the human genome and a concatenation of all complete bacterial genomes. On each dataset, we quantified the capacity of the different tools to retrieve all the occurrences of the reads in the reference genome. Special attention was paid to reads uniquely reported and to reads with multiple hits.
International audience
In this report, we prove that under a Markovian model of order one, the average depth of suffix trees of index n is asymptotically similar to the average depth of tries (a.k.a. digital trees) built on n independent strings. This leads to an asymptotic behavior of $(\log{n})/h + C$ for the average of the depth of the suffix tree, where $h$ is the entropy of the Markov model and $C$ is constant. Our proof compares the generating functions for the average depth in tries and in suffix trees; the difference between these generating functions is shown to be asymptotically small. We conclude by using the asymptotic behavior of the average depth in a trie under the Markov model found by Jacquet and Szpankowski ([JaSz91]).
International audience
We consider a component of the word statistics known as clump; starting from a finite set of words, clumps are maximal overlapping sets of these occurrences. This object has first been studied by Schbath with the aim of counting the number of occurrences of words in random texts. Later work with similar probabilistic approach used the Chen-Stein approximation for a compound Poisson distribution, where the number of clumps follows a law close to Poisson. Presently there is no combinatorial counterpart to this approach, and we fill the gap here. We also provide a construction for the yet unsolved problem of clumps of an arbitrary finite set of words. In contrast with the probabilistic approach which only provides asymptotic results, the combinatorial method provides exact results that are useful when considering short sequences.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.