Real-world datasets often suffer from various data quality problems. Several data cleaning solutions have been proposed so far. However, data cleaning remains a manual and iterative task that requires domain and technical expertise. Exploiting metadata promises to improve the tedious process of data preparation, because data errors are detectable through metadata. This article investigates the intrinsic connection between metadata and data errors. In this work, we establish a mapping that reflects the connection between data quality issues and extractable metadata using qualitative and quantitative techniques. Additionally, we present a taxonomy based on a closed grammar that covers all existing metadata and allows the composition of novel types of metadata. We provide a case-study to show the practical application of the grammar for generating new metadata for data quality assessment.
ii IntroductionWelcome to the First AHA!-Workshop on Information Discovery in Text! In this workshop, we are bringing together leading researchers in the emerging field of Information Discovery to discuss approaches for Information Extraction that are not bound by a pre-specified schema of information, but rather discover relational or categorial structure automatically from given unstructured data.This includes approaches that are based on unsupervised machine-learning over models of distributional semantics, as well as OpenIE methods that relax the definition of semantic relations in order to more openly extract structured information. Other approaches focus on inexpensively training information extractors to be used across different domains with minimal supervision, or on adapting existing IE systems to new domains and relations. We received 19 paper submissions of which the programme committee has accepted ten -six of which were chosen for oral presentation and four as posters.We look forward to a workshop full of interesting paper presentations, invited talks and lively discussion. AbstractRecent approaches to relation extraction following the distant supervision paradigm have focused on exploiting large knowledge bases, from which they extract substantial amount of supervision. However, for many relations in real-world applications, there are few instances available to seed the relation extraction process, and appropriate named entity recognizers which are necessary for pre-processing do not exist. To overcome this issue, we learn entity filters jointly with relation extraction using imitation learning. We evaluate our approach on architect names and building completion years, using only around 30 seed instances for each relation and show that the jointly learned entity filters improved the performance by 30 and 7 points in average precision.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.