In this paper, we investigate how much similarity good news and bad news have in context of long-terms market trends and we discuss the relation between information retrieval and text mining. We have analyzed about 400 thousand news stories coming from the years 1999 to 2002 and we argue that classification methods of information retrieval are not strong enough to solve problems like this one because the meaning of news is given not only by the used words and their frequency but also by the structure of sentences and their context. We present results of our experiments and examples of news that support this statement.
Validity:Until the end of winter semester 2016/17
InstructionsSparkSQL framework enables distributed and parallel data processing of various formats using SQL-like query language. The main goal of the master thesis is to use the SparkSQL framework to implement a subset of expressions from the XPath query language, which is used for querying XML data.1. Get acquainted with the Apache Spark engine, mainly focus on its SparkSQL framework. 2. Study the works related to the process of mapping the XML database technology (XML documents) to the relational database technology. 3. Based on your knowledge, design a query engine that will be able to evaluate XPath queries over XML documents. 4. Implement a prototype of the designed solution using the SparkSQL framework. 5. Perform suitable testing on the implemented prototype, primarily aim on its functional properties. 6. Create a summary of the performed testing and assess the possibility of its deployment in a highly distributed environment.
ReferencesWill be provided by the supervisor.
DeclarationI hereby declare that the presented thesis is my own work and that I have cited all sources of information in accordance with the Guideline for adhering to ethical principles when elaborating an academic final thesis.I acknowledge that my thesis is subject to the rights and obligations stipulated by the Act No. 121/2000 Coll., the Copyright Act, as amended, in particular that the Czech Technical University in Prague has the right to conclude a license agreement on the utilization of this thesis as school work under the provisions of Article 60 (1)
Citation of this thesis
AbstractThe main goal of this thesis is to use Spark SQL framework to implement a subset of expressions from XPath query language. The first part of this thesis is focused on introducing the Apache Spark project. The second part covers analysis of mapping XML documents into the tabular form using an encoding of nodes that keeps a document order. Also the approach to the solution that uses Spark's features is described in the second part. The third part of the thesis is focused on implementation and testing of designed solution.Keywords XML, XPath, SQL, Spark, Spark SQL, DataFrame, Dewey order encoding
AbstraktCieľom tejto práce je implementovať podmnožinu výrazov jazyka XPath pomocou systému Spark SQL. Prvá časť práce je zameraná na predstavenie projektu Apache Spark. Druhá časť pokrýva analýzu možnosti mapovania ix XML dokumentov do formy tabuľky použitím kódovania prvkov, ktoré zachováva ich poradie v rámci dokumentu. V druhej časti je taktiež popísaných niekoľko spôsobov riešenia, ktoré využívajú funkcie systému Spark. Tretia časť tejto práce je zameraná na implementáciu a testovanie navrhnutého riešenia.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.