XML holds the promise to yield (1) a more precise search by providing additional information in the elements, (2) a better integrated search of documents from heterogeneous sources, (3) a powerful search paradigm using structural as well as content specifications, and (4) data and information exchange to share resources and to support cooperative search. We survey several indexing techniques for XML documents, grouping them into flatfile, semistructured, and structured indexing paradigms. Searching techniques and supporting techniques for searching are reviewed, including full text search and multistage search. Because searching XML documents can be very flexible, various search result presentations are discussed, as well as database and information retrieval system integration and XML query languages. We also survey various retrieval models, examining how they would be used or extended for retrieving XML documents. To conclude the article, we discuss various open issues that XML poses with respect to information retrieval and database research.
IntroductionAn Internet search engine (e.g., Altavista or Infoseek) returns thousands of so-called matched documents from a single query, some of which are relevant and others irrelevant to the query. End users typically have problems with organizing and digesting such vast quantities of information, in which much (i.e., 75% as pointed out by Selberg and Etzioni, 1997) of the information retrieved is likely to be irrelevant. XML holds the promise that searching can be done more precisely because structural, self-describing information and meta-data (e.g., RDF) is available, to allow for context-based and/or category-based search. XML also holds the promise to model heterogeneous data, generated from databases (DBs) or from word processors, thereby enabling search engines to locate and process heterogeneous documents or records.An XML document consists of a set of elements, which are hierarchically structured, as defined by the user. Each element has a name (e.g., p for paragraph), which is defined by the user. Data of an element (say, p) can be stored inside the element delimited by its start tag (i.e., ͗p͘) and its end tag (i.e., ͗/p͘), or it can be stored as values in its attribute (e.g., ͗p idϭ "1"͘). Certain attribute value types are reserved for referencing (e.g., IDREF). An XML element is accessed typically using the XPath language. Child elements and their parent element are separated by a slash. For example, the XPath /header/author/first accesses the first element from the root element header, and then the author element.It is possible to use other mark-up languages (e.g., HTML) or proprietary formats but XML appears to be suitable for a wide variety of information retrieval (IR) tasks, specific enough to reduce modeling complexity and open enough for easy and rapid adoption. A major advantage of XML over HTML is that users can define their own tags. Tag names are typically chosen to incorporate some relationship to the semantics of the contents or the type of co...