Abstract. As XML is increasingly being used in Web applications, new technologies need to be investigated for processing XML documents with high performance. Parallelism is a promising solution for structured document processing and data placement is a major factor for system performance improvement in parallel processing. This paper describes an effective XML document data placement strategy. The new strategy is based on a multilevel graph partitioning algorithm with the consideration of the unique features of XML documents and query distributions. A new algorithm, which is based on XML query schemas to derive the weighted graph from the labelled directed graph presentation of XML documents, is also proposed. Performance analysis on the algorithm presented in the paper shows that the new data placement strategy exhibits low workload skew and a high degree of parallelism.Keywords: Data Placement, XML Documents, Graph Partitioning, and Parallel Data Processing.
IntroductionAs a new markup language for structured documentation, XML (eXtensible Markup Language) is increasingly being used in Web applications because of its unique features in data representation and exchange. The main advantage of XML is that each XML file can have a semantic schema and makes it possible to define much more meaningful queries than simple, keyword-based retrievals. A recent survey shows that the number of XML business vocabularies has increased from 124 to over 250 in six months [1]. It can be expected that data in XML format would be largely available throughout the Web in the near future. As Web applications are time vulnerable, the increasing size of XML documents and the complexity of evaluating XML queries pose new performance challenges to existing information retrieval technologies. The use of parallelism has shown good scalability in traditional database applications and provides an attractive solution to process structured documents [2]. A large number of XML documents can be distributed onto several processing nodes so that a reasonable query response time can be achieved by processing the related data in parallel.
As Web applications are time vulnerable, the increasing size of XML documents and the complexity of evaluating XML queries pose new performance challenges to existing information retrieval technologies. This paper introduces a new approach for developing a purpose-built XML data management system to improve the system performance of a Web site with XML support by using parallel data processing techniques. To improve the system performance, we proposed a parallelisation model for XML data processing, where the data storage strategies, data placement methods and query evaluation techniques have been studied. Other related issues are also presented.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.