Exact pattern matching aims to locate all occurrences of a pattern in a text. Many algorithms have been proposed, but two algorithms, the Knuth-Morris-Pratt (KMP) and the Boyer-Moore (BM), are most widespread. It is the basis of some approximate string matching algorithms like BLAST, and in many cases it is desirable to locate an exact rather than approximate matches. Although several studies included measures with small alphabets, none of them specifically designed an algorithm to target nucleotide sequences. Since there are also no application programming interfaces available for pattern matching in nucleotide sequences, these two issues were aimed to be resolved. A
AbstractExact pattern matching aims to locate all occurrences of a pattern in a text. Many algorithms have been proposed, but two algorithms, the Knuth-Morris-Pratt (KMP) and the Boyer-Moore (BM), are most widespread. It is the basis of some approximate string matching algorithms like BLAST, and in many cases it is desirable to locate an exact rather than approximate matches. Although several studies included measures with small alphabets, none of them specifically designed an algorithm to target nucleotide sequences. Since there are also no application programming interfaces available for pattern matching in nucleotide sequences, these two issues were aimed to be resolved. A portion of the Chlamydomonas reinhardtii genome (30 mega bases) was searched with queries ranging from 10 to 2000 nucleotides and an alternating number of matches between one and 25000. The results indicate that the use of two of the algorithms developed in this study is sufficient to efficiently cover the complete search space as presented in the experiment conducted here. Thus the aim of implementing an algorithm specifically targeting pattern matching in nucleotide sequences and making it available to the general public as an advanced programming interface was achieved. All algorithms are freely available at: