This tutorial gives an overview on state-of-the-art methods for the automatic construction of large knowledge bases and harnessing them for data and text analytics. It covers both big-data methods for building knowledge bases and knowledge bases being assets for big-data applications. The tutorial also points out challenges and research opportunities.
MOTIVATION AND SCOPEComphrehensive machine-readable knowledge bases (KB's) have been pursued since the seminal projects Cyc [19,20] and WordNet [12]. In contrast to these manually created KB's, great advances have recently been made on automating the building and curation of large KB's [1,16] This tutorial presents state-of-the-art methods, recent advances, research opportunities, and open challenges along this avenue of knowledge harvesting and its applications. Particular emphasis is on the twofold role of KB's for bigdata analytics: using scalable distributed algorithms for harvesting knowledge from Web and text sources, and leveraging entity-centric knowledge for deeper interpretation of and
BUILDING KNOWLEDGE BASESDigital Knowledge: Today's KB's represent their data mostly in RDF-style SPO (subject-predicate-object) triples. We introduce this data model and the most salient KB projects, which include KnowItAll [10,11] Harvesting Knowledge on Entities and Classes: Every entity in a KB (e.g., Steve Jobs) belongs to one or multiple classes (e.g., computer pioneer, entrepreneur). These classes are organized into a taxonomy, where more special classes are subsumed by more general classes (e.g., person). We discuss two families of methods to harvest such information: Wikipedia-based approaches that analyze the category system, and Web-based approaches that use techniques like set expansion.
HARVESTING FACTS AT WEB SCALEHarvesting Relational Facts: Relational facts express properties of and relationships between entities. There is a large spectrum of methods to extract such facts from Web documents. We give an overview on methods from pattern matching (e.g., regular expressions), computational linguistics (e.g., dependency parsing), statistical learning (e.g., factor graphs and MLN's), and logical consistency reasoning (e.g., weighted MaxSat or ILP solvers). We also discuss to what extent these approaches scale to handle big data.Open Information Extraction: Alternatively to methods that operate on a pre-specified set of relations and entities, open information extraction harvests arbitrary SPO triples from natural language documents. It aggressively taps into noun phrases as entity candidates and verbal phrases as prototypic patterns for relations. We discuss recent methods that follow this direction. Some methods along these lines make clever use of big-data techniques like frequent sequence mining and map-reduce computation.Temporal and Multilingual Knowledge: Properly interpreting entities and facts in a KB often requires additional meta-information like entity names in different languages and the temporal scope of facts. We discuss tech-1713