Abstract. The discipline of Algebraic Dynamic Programming is a powerful method to design and implement versatile pattern matching algorithms on sequences; here we consider mixed sequence and secondary structure motifs in RNA. A recurring challenge when designing new pattern matchers is to provide a statistical analysis of pattern significance. We demonstrate that by the use of so-called canonical pattern descriptions, the expected number of hits on a sequence of length n can be computed a priori, using the pattern matcher itself. This provides a systematic way to calibrate the specificity of pattern matching algorithms. The technique is exemplified by examples using IRE and SECIS elements.
Motivation and Overview
Evaluation of Pattern SignificanceSignificance of standard patterns. A mathematical evaluation of the significance of hits is essential in all pattern matching applications. For simple sequence patterns, the approach of [2], implemented in the tool Verbumculus, provides a complete analysis of over-and underrepresented words in a sequence. The E-values computed by the BLAST programs [1] for sequence similarity search give an account of how small the probability is to find a given match by chance. Similarly, the REPuter tool [24] provides significance scores in the analysis of approximate repeats and palindromes in genome data.Significance analysis of general patterns. Sequence subwords, local similarity and repeats are rather elementary patterns. In searching for, say, regulatory elements in RNA, we need all of the above, and more: We design patterns that adhere to a structural shape, may carry some well defined sequence motifs, but else allow considerable variation in both sequence and structure. When designing such a motif description, a central problem is to balance specificity against variation, or in statistical terms: to analyse E m (n, c), the expected number of hits of the motif m on a random sequence of length n and base composition c.Note that E m (n, c) is a property of the motif, in contrast to a BLAST E-value that rates a particular motif instance. Although an analysis of