Coupling semi-supervised learning of categories and relations

Carlson, Andrew; Betteridge, Justin; Hruschka, Estevam R.; Mitchell, Tom M.

doi:10.3115/1621829.1621830

Cited by 45 publications

(37 citation statements)

References 16 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Information about relation similarity is used in training and evaluation, as it roughly indicates how confusable the linguistic expression of two relations are. This would indicate, for example, that relation colearning (Carlson et al 2009) would not work for similar relations. Ambiguity is defined for each relation as the max relation similarity for the relation.…”

Section: Crowd Truthmentioning

confidence: 99%

Truth Is a Lie: Crowd Truth and the Seven Myths of Human Annotation

2015

View full text Add to dashboard Cite

Big data is having a disruptive impact across the sciences. Human annotation of semantic interpretation tasks is a critical part of big data semantics, but it is based on an antiquated ideal of a single correct truth that needs to be similarly disrupted. We expose seven myths about human annotation, most of which derive from that antiquated ideal of truth, and dispell these myths with examples from our research. We propose a new theory of truth, crowd truth, that is based on the intuition that human interpretation is subjective, and that measuring annotations on the same objects of interpretation (in our examples, sentences) across a crowd will provide a useful representation of their subjectivity and the range of reasonable interpretations.

show abstract

Section: Crowd Truthmentioning

confidence: 99%

Truth Is a Lie: Crowd Truth and the Seven Myths of Human Annotation

2015

View full text Add to dashboard Cite

show abstract

“…While these earlier methods showed the feasibility of semi-supervised learning of extraction patterns, they were limited because accurate learning requires more constraints than are provided by a few dozen labeled training examples. Our algorithm achieves significantly higher accuracy by using the input ontology itself to provide additional constraints that guide the learner [9]. For example, when our algorithm learns extraction patterns for the predicates 'person', 'team' and 'plays-on-team', prior knowledge from the ontology requires that for any unlabeled sentence containing noun phrases A and B, the extractor for 'plays-on-team' can label <A, B > a positive example of the relation only if the 'person' classifier labels A positive, and the 'team' classifier labels B positive.…”

Section: The Problemmentioning

confidence: 99%

“…The textual pattern learner, CBL [9], iteratively grows a set of extraction patterns while obeying mutual exclusion, subset, and type checking constraints given by the ontology. The HTML pattern learner, SEAL [10], learns patterns of HTML and text tokens that capture regularities such as HTML lists of predicate instances.…”

Section: The Readtheweb Systemmentioning

confidence: 99%

Populating the Semantic Web by Macro-reading Internet Text

Mitchell

Betteridge

Carlson

et al. 2009

Lecture Notes in Computer Science

Self Cite

View full text Add to dashboard Cite

Abstract. A key question regarding the future of the semantic web is "how will we acquire structured information to populate the semantic web on a vast scale?" One approach is to enter this information manually. A second approach is to take advantage of pre-existing databases, and to develop common ontologies, publishing standards, and reward systems to make this data widely accessible. We consider here a third approach: developing software that automatically extracts structured information from unstructured text present on the web. We also describe preliminary results demonstrating that machine learning algorithms can learn to extract tens of thousands of facts to populate a diverse ontology, with imperfect but reasonably good accuracy. The ProblemThe future impact of the semantic web will depend critically on the breadth and depth of its content. One can imagine several approaches to constructing this content, including manual content entry by motivated teams of people, convincing owners of existing databases to publish them on the semantic web, and automatically extracting structured information from the vast quantity of unstructured online text. We consider here the third of these approaches, and argue both that it is feasible and that this kind of approach will be able to collect knowledge that is unlikely to be captured as easily by other approaches.The feasibility of extracting structured information automatically from text will itself depend on the technical state-of-the-art of natural language processing (NLP) methods. We have witnessed significant progress in NLP over the past decade, on problems from sentence parsing [1] to named entity extraction [2], to question answering [3], to document classification [4]. Nevertheless, computer algorithms remain very far from being able to truly "understand" natural language text (e.g., to read and extract the full content of the paper you are currently reading). Given this shortcoming, why might we take the position that NLP algorithms offer a promising near-term approach to populating the semantic web?We believe automatic methods offer a feasible near-term approach because the problem of automatically populating large databases from the internet can be formulated so that it is much easier to solve than the problem of full natural language understanding. Our own formulation involves three key design choices:

show abstract

“…9 The pseudo-code of the proposed algorithm to wrapper induction It can be noticed that the first concept k 1 aggregates in itself the information about the 6 , o 7 surrounded from the left by such prefixes as <li class = "film title"><br/> and <ul><li class = "film title"><br/> etc. The objects o i ∈ k 1 are surrounded by HTML tokens expansions of lengths conceptLength(k 1 ) = {2, 3}.…”

Section: Figmentioning

confidence: 99%

“…[21]. IESs, such as Never-Ending Language Learner (NELL), Know It All, TextRunner, or Snowball represent this approach [1,3,6,9,10,22,23,56,59,68,78]. The systems mentioned above represent the trend called open IE.…”

Section: State Of the Art And Related Workmentioning

confidence: 99%

The BigGrams: the semi-supervised information extraction system from HTML: an improvement in the wrapper induction

Mirończuk

2017

Knowl Inf Syst

View full text Add to dashboard Cite

The aim of this study is to propose an information extraction system, called BigGrams, which is able to retrieve relevant and structural information (relevant phrases, keywords) from semi-structural web pages, i.e. HTML documents. For this purpose, a novel semi-supervised wrappers induction algorithm has been developed and embedded in the BigGrams system. The wrappers induction algorithm utilizes a formal concept analysis to induce information extraction patterns. Also, in this article, the author (1) presents the impact of the configuration of the information extraction system components on information extraction results and (2) tests the boosting mode of this system. Based on empirical research, the author established that the proposed taxonomy of seeds and the HTML tags level analysis, with appropriate pre-processing, improve information extraction results. Also, the boosting mode works well when certain requirements are met, i.e. when well-diversified input data are ensured.

show abstract

Coupling semi-supervised learning of categories and relations

Cited by 45 publications

References 16 publications

Truth Is a Lie: Crowd Truth and the Seven Myths of Human Annotation

Truth Is a Lie: Crowd Truth and the Seven Myths of Human Annotation

Populating the Semantic Web by Macro-reading Internet Text

The BigGrams: the semi-supervised information extraction system from HTML: an improvement in the wrapper induction

Contact Info

Product

Resources

About