Summary. XML was born to represent, exchange and publish information on the Web, but now it has spread in many other applications. Due to this success, the W3C has proposed a new query language, XQuery, specifically designed to query XML data. XQuery allows to obtain exact answers to queries; however when applied to large XML repositories or warehouses, such precise queries may require high response times. Our research proposes a methodology for the semi-automatic derivation of summarized documents (synopses) for massive, heterogeneous XML data-sets, with the final aim of producing query transformation rules from queries on the original data-sets to queries on the summarized data-set.
Introduction and MotivationIn the last few years, XML has spread in many application fields and today it is used as a format to exchange data on the web, to ensure interoperability among applications. Due to this success, the W3C has proposed a new query language, XQuery [W3C04], specifically designed to query XML data. XQuery is a well-defined but rather complex language [HPG04]. In this work we propose a new approach to overcome the problem of the high computational costs required by aggregate queries over massive XML data collections. In traditional relational warehouses [GPA + 98] a similar problem is solved by means of fast approximate queries, that use concise data statistics based on histograms or on other statistical techniques. Their most common application is for aggregate queries in modern decision support systems, where large volumes of data need to be queried, and quick and interactive responses from the DBMS are claimed, e.g., to analyze the data in the warehouse in order to get trend information to evaluate marketing strategies. In such applications, users are often more interested to obtain an approximate answer computed in a short time rather than an exact one obtained in some minutes or, at the worst, hours.