We present nrgrep ('non-deterministic reverse grep'), a new pattern-matching tool designed for efficient search of complex patterns. Unlike previous tools of the grep family, such as agrep and Gnu grep, nrgrep is based on a single and uniform concept: the bit-parallel simulation of a non-deterministic suffix automaton. As a result, nrgrep can find from simple patterns to regular expressions, exactly or allowing errors in the matches, with an efficiency that degrades smoothly as the complexity of the searched pattern increases. Another concept that is fully integrated into nrgrep and that contributes to this smoothness is the selection of adequate subpatterns for fast scanning, which is also absent in many current tools. We show that the efficiency of nrgrep is similar to that of the fastest existing string-matching tools for the simplest patterns, and is by far unmatched for more complex patterns.
1266G. NAVARRO unwilling to maintain an index for that purpose), dynamic text collections (where the cost of keeping an up-to-date index is prohibitive, including the searchers inside text editors and Web interfaces ‡ ), for not very large texts (up to a few hundred megabytes) and even as internal tools of indexed schemesThere is a large class of string matching algorithms in the literature (see, for example, [5-7]) but not all of them are practical. There is also a wide variety of fast online string matching tools in the public domain, most prominently the grep family. Among these, Gnu grep and Wu and Manber's agrep [1] are widely known and currently considered to be the fastest string-matching tools in practice. Another distinguishing feature of these software systems is their flexibility: they can search not only for simple strings, but they also permit classes of characters (that is, a pattern position matches a set of characters), wild cards (a pattern position that matches an arbitrary string), regular expression searching, multipattern searching, etc. Agrep also permits approximate searching: the pattern matches the text after performing a limited number of alterations on it.The algorithmic principles behind agrep are diverse [8]. Exact string matching is done with the Horspool algorithm [9], a variant of the Boyer-Moore family [10]. The speed of the Boyer-Moore string-matching algorithms comes from their ability to 'skip' (i.e. not inspect) some text characters. Agrep deals with more complex patterns using a variant of Shift-Or [11], an algorithm exploiting 'bit parallelism' (a concept that we explain later) to simulate non-deterministic automata (NFA) efficiently. Shift-Or, however, cannot skip text characters. Multipattern searching is treated with bit parallelism or with a different algorithm depending on the case. As a result, the search performance of agrep varies sharply depending on the type of search pattern, and even slight modifications to the pattern yield widely different search times. For example, the search for the string "algorithm" is seven times faster than for "[Aa]lgorithm" (where "[Aa]" is a ...