Validity:Until the end of winter semester 2016/17
InstructionsSparkSQL framework enables distributed and parallel data processing of various formats using SQL-like query language. The main goal of the master thesis is to use the SparkSQL framework to implement a subset of expressions from the XPath query language, which is used for querying XML data.1. Get acquainted with the Apache Spark engine, mainly focus on its SparkSQL framework. 2. Study the works related to the process of mapping the XML database technology (XML documents) to the relational database technology. 3. Based on your knowledge, design a query engine that will be able to evaluate XPath queries over XML documents. 4. Implement a prototype of the designed solution using the SparkSQL framework. 5. Perform suitable testing on the implemented prototype, primarily aim on its functional properties. 6. Create a summary of the performed testing and assess the possibility of its deployment in a highly distributed environment.
ReferencesWill be provided by the supervisor.
DeclarationI hereby declare that the presented thesis is my own work and that I have cited all sources of information in accordance with the Guideline for adhering to ethical principles when elaborating an academic final thesis.I acknowledge that my thesis is subject to the rights and obligations stipulated by the Act No. 121/2000 Coll., the Copyright Act, as amended, in particular that the Czech Technical University in Prague has the right to conclude a license agreement on the utilization of this thesis as school work under the provisions of Article 60 (1)
Citation of this thesis
AbstractThe main goal of this thesis is to use Spark SQL framework to implement a subset of expressions from XPath query language. The first part of this thesis is focused on introducing the Apache Spark project. The second part covers analysis of mapping XML documents into the tabular form using an encoding of nodes that keeps a document order. Also the approach to the solution that uses Spark's features is described in the second part. The third part of the thesis is focused on implementation and testing of designed solution.Keywords XML, XPath, SQL, Spark, Spark SQL, DataFrame, Dewey order encoding
AbstraktCieľom tejto práce je implementovať podmnožinu výrazov jazyka XPath pomocou systému Spark SQL. Prvá časť práce je zameraná na predstavenie projektu Apache Spark. Druhá časť pokrýva analýzu možnosti mapovania ix XML dokumentov do formy tabuľky použitím kódovania prvkov, ktoré zachováva ich poradie v rámci dokumentu. V druhej časti je taktiež popísaných niekoľko spôsobov riešenia, ktoré využívajú funkcie systému Spark. Tretia časť tejto práce je zameraná na implementáciu a testovanie navrhnutého riešenia.