Information extraction traditionally focuses on extracting relations between identifiable entities, such as Monterey, locatedIn, California . Yet, texts often also contain Counting information, stating that a subject is in a specific relation with a number of objects, without mentioning the objects themselves, for example, "California is divided into 58 counties". Such counting quantifiers can help in a variety of tasks such as query answering or knowledge base curation, but are neglected by prior work. This paper develops the first full-fledged system for extracting counting information from text, called CINEX. We employ distant supervision using fact counts from a knowledge base as training seeds, and develop novel techniques for dealing with several challenges: (i) non-maximal training seeds due to the incompleteness of knowledge bases, (ii) sparse and skewed observations in text sources, and (iii) high diversity of linguistic patterns. Experiments with five human-evaluated relations show that CINEX can achieve 60% average precision for extracting counting information. In a large-scale experiment, we demonstrate the potential for knowledge base enrichment by applying CINEX to 2,474 frequent relations in Wikidata. CINEX can assert the existence of 2.5M facts for 110 distinct relations, which is 28% more than the existing Wikidata facts for these relations.
arXiv:1807.03656v1 [cs.CL] 10 Jul 2018Second, an important use case is KB curation [8,34]. KBs are notoriously incomplete, contain erroneous triples, and are limited in keeping up with the pace of real-world changes. Counting information helps to identify gaps and inaccuracies. For example, knowing the exact number of counties in California or a lower bound for the number of films directed by Eastwood are important cues to complete and enrich a KB.State-of-the-Art and Challenges. The predominant approach to extracting facts for KB population is distant supervision, using seeds for the SPO triples of interest (e.g., [21,32]). The seeds are usually taken from an initial KB or are manually compiled. Spotting the seeds in a text corpus (e.g., Clint Eastwood, directed and Gran Torino) then allows learning patterns for relations (e.g., "director of" or " someone 's masterpiece"), which in turn lead to observing new fact candidates. This methodology is known as the pattern-relation duality principle [2].Distant supervision is a natural approach for extracting counting information as well: the cardinality of distinct O arguments for a given SP pair, n := |{O | SP O ∈ KB }|, serves as a seed for the counting assertion, S, P, ∃n . However, it is more challenging than traditional SPO-fact extraction and needs to cope with several issues: 1) Non-maximal seeds: Unlike for SPO-fact extraction, the incompleteness of KBs not only leads to a reduction in the number of seeds, but to seeds that systematically underestimate the count of facts that are valid in reality. For example, a KB that knows only a subset of Trump's children, say three out of five, leads to a non-maximal s...