DEFINITIONStructured text retrieval models provide a formal definition or mathematical framework for querying semistructured textual databases. A textual database contains both content and structure. The content is the text itself, and the structure divides the database into separate textual parts and relates those textual parts by some criterion. Often, textual databases can be represented as marked up text, for instance as XML, where the XML elements define the structure on the text content. Retrieval models for textual databases should comprise three parts: 1) a model of the text, 2) a model of the structure, and 3) a query language [4]: The model of the text defines a tokenization into words or other semantic units, as well as stop words, stemming, synonyms, etc. The model of the structure defines parts of the text, typically a contiguous portion of the text called element, region, or segment, which is defined on top of the text model's word tokens. The query language typically defines a number of operators on content and structure such as set operators and operators like "containing" and "contained-by" to model relations between content and structure, as well as relations between the structural elements themselves. Using such a query language, the (expert) user can for instance formulate requests like "I want a paragraph discussing formal models near to a table discussing the differences between databases and information retrieval". Here, "formal models" and "differences between databases and information retrieval" should match the content that needs to be retrieved from the database, whereas "paragraph" and "table" refer to structural constraints on the units to retrieve. The features, structuring power, and the expressiveness of the query languages of several models for structured text retrieval are discussed below.
HISTORICAL BACKGROUNDThe STAIRS system (Storage and Information Retrieval System), which was developed at IBM already in the late 1950's allowed querying both content and structure. Much like today's On-line Public Access Catalogues, it was used to store bibliographic data in records with fields such as keywords and title, providing structured search, but no overlapping or hierarchical structures nor full text search. At the end of the 1980's, researchers at the University of Waterloo in Canada researched database support for the creation of an electronic version of the Oxford English Dictionary. This resulted in a number of models for querying and manipulating content and hierarchical structure such as the parsed strings model [10], PAT expressions [15], the containment model [5] and generalized concordance lists model [7]. Similar approaches were developed elsewhere, such as the proximal nodes model [13] and the nested region model [11]. The interest in structured text retrieval models has grown since the introduction of XML in 1998, and the emergence of standard data retrieval query languages (see XPATH/XQUERY) for XML data. One might argue that the structured text retrieval approaches such...