Many efforts have been deployed by the IR community to extend free-text
query processing toward semi-structured XML search. Most methods rely on the
concept of Lowest Comment Ancestor (LCA) between two or multiple structural
nodes to identify the most specific XML elements containing query keywords
posted by the user. Yet, few of the existing approaches consider XML
semantics, and the methods that process semantics generally rely on
computationally expensive word sense disambiguation (WSD) techniques, or
apply semantic analysis in one stage only: performing query
relaxation/refinement over the bag of words retrieval model, to reduce
processing time. In this paper, we describe a new approach for XML keyword
search aiming to solve the limitations mentioned above. Our solution first
transforms the XML document collection (offline) and the keyword query
(on-the-fly) into meaningful semantic representations using context-based
and global disambiguation methods, specially designed to allow almost linear
computation efficiency. We use a semantic-aware inverted index to allow
semantic-aware search, result selection, and result ranking functionality.
The semantically augmented XML data tree is processed for structural node
clustering, based on semantic query concepts (i.e., key-concepts), in order
to identify and rank candidate answer sub-trees containing related
occurrences of query key-concepts. Dedicated weighting functions and various
search algorithms have been developed for that purpose and will be presented
here. Experimental results highlight the quality and potential of our
approach.