This thesis offers two logic-based approaches to conjunctive queries in the context of information extraction. The first and main approach is the introduction of conjunctive query fragments of the logics FC and FC [REG], denoted as FC-CQ and FC[REG]-CQ respectively. FC is a first-order logic based on word equations, where the semantics are defined by limiting the universe to the factors of some finite input word. FC [REG] is FC extended with regular constraints.Our first results consider the comparative expressive power of FC[REG]-CQ in relation to document spanners (a formal framework for the query language AQL), and various fragments of FC[REG]-CQ -some of which coincide with well-known language generators, such as patterns and regular expressions. better supervisor and mentor.I'd also like to thank the numerous people at Loughborough university who have helped me during my time here, both as an undergraduate student and a PhD student. I am grateful to Daniel Reidenbach for being my second supervisor, as well as for sparking my interest in theoretical computer science with his undergraduate module. Thank you to Andrea Soltoggio and Szymon Łopaciuk for all the interesting discussions over lunch. An additional thanks to Szymon for numerous conversations in the office about research and more. I wish to express my gratitude to Parisa Derakhshan, Iain Phillips, Robert Mercaş, Manfred Kufleitner, Dominik D. Freydenberger, and Yanning Yang for allowing me to teach their modules. Doing so has made me a more well-rounded computer scientist.I am grateful to Joel D. Day and Ana Sălăgean for agreeing to be my internal examiners, and Anthony W. Lin for being my external examiner.I would also like to thank Justin Brackemann for the helpful feedback and list of typos in the proof of Theorem 4.12.Last, but not least, I'd like to thank my family. Without all their encouragement, I would not have been able to finish this project. I can't put into words how thankful I am for my parents, my sister, and my grandparents.v
Contents
Abstract iiiAcknowledgements v Document Spanners. Document spanners were introduced by Fagin, Kimelfeld, Reiss, and Vansummeren [31] as a formal framework for the Annotation Query Language (or AQL) used in IBM's SystemT. Regarding data complexity, Florenzano, Riveros, Ugarte, Vansummeren, and Vrgoc [34] gave a constant-delay algorithm for enumerating the results of deterministic vset-automata, after linear time preprocessing. Amarilli, Bourhis, Mengal, and Niewerth [3] extended this result to nondeterministic vset-automata.Regarding combined complexity, Freydenberger, Kimelfeld, and Peterfreund [39] introduced regex CQs and proved that their evaluation is NP-complete (even for acyclic queries), and that fixing the number of atoms and the number of equalities in SERCQs allows for polynomial-delay enumeration of results. Freydenberger, Peterfreund, Kimelfeld, and Kröll [89] showed that non-emptiness for a join of two sequential regex formulas is NP-hard, under schemaless semantics, even for a single c...