Most modern libraries for regular expression matching allow back-references (i. e., repetition operators) that substantially increase expressive power, but also lead to intractability. In order to find a better balance between expressiveness and tractability, we combine these with the notion of determinism for regular expressions used in XML DTDs and XML Schema. This includes the definition of a suitable automaton model, and a generalization of the Glushkov construction. We demonstrate that, compared to their non-deterministic superclass, these deterministic regular expressions with back-references have desirable algorithmic properties (i. e., efficiently solvable membership problem and some decidable problems in static analysis), while, at the same time, their expressive power exceeds that of deterministic regular expressions without back-references.
IntroductionRegular expressions were introduced in 1956 by Kleene [34] and quickly found wide use in both theoretical and applied computer science, including applications in bioinformatics [41], programming languages [49], model checking [48], and XML schema languages [47]. While the theoretical interpretation of regular expressions remains mostly unchanged (as expressions that describe exactly the class of regular languages), modern applications use variants that vary greatly in expressive power and algorithmic properties. This paper tries to find common ground between two of these variants with opposing approaches to the balance between expressive power and tractability.
REGEXThe first variant that we consider are regex, regular expressions that are extended with a backreference operator. This operator is used in almost all modern programming languages (like e. g. Java, PERL, and .NET). For example, the regex x : (a ∨ b) * · &x defines {ww | w ∈ {a, b} * }, as (a ∨ b) * can create a w ∈ {a, b} * , which is then stored in the variable x and repeated with the reference &x. Hence, back-references allow to define non-regular languages; but with the side effect that the membership problem is NP-complete (cf. Aho [2]).Regex were first examined from a theoretical point of view by Aho [2], but without fully defining the semantics. There were various proposals for semantics, of which we mention the first by Câmpeanu, Salomaa, Yu [10], and the recent one by Schmid [46], which is the basis for this paper. Apart from defining the semantics, there was work on the expressive power [10,11,25], the static analysis [11,23,24], and the tractability of the membership problem (investigated in terms of a strongly restricted subclass of regex) [21,22]. They have also been compared to related models in database theory, e. g. graph databases [4,26] and information extraction [20,24]. * This work represents an extended version of the paper "Deterministic Regular Expressions with Back-References" presented at STACS 2017 and published in LIPICS (http://dx.doi.org/10.4230/LIPIcs.STACS.2017.33 ).3. The intersection-emptiness problem for DRX is undecidable, but in PSPACE for variablestar-free...