Selecting actions for resource-bounded information extraction using reinforcement learning

Kanani, Pallika; McCallum, Andrew

doi:10.1145/2124295.2124328

Cited by 19 publications

(22 citation statements)

References 10 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Such a setup would be more adaptive with respect to the number of queries asked and could thus be potentially more effective at avoiding to ask too many queries (cf. [9]). …”

Section: Discussionmentioning

confidence: 99%

“…While this is a relatively new approach, there are some related works. The most similar is perhaps Kanani and McCallum's [9] work on using reinforcement learning to learn an optimal policy for efficiently filling in missing values in a KB (they focus on filling in the email address, job title, and department affiliation of 100 professors at UMass Amherst). The actions available are to perform one of 20 possible types of query (e.g., name, name + "CV", name + "Amherst"), to download one of the n resulting Web pages, or to extract one of the three relations from the page.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Knowledge base completion via search-based question answering

West

Gabrilovich

Murphy

et al. 2014

Proceedings of the 23rd International Conference on World Wide Web

247

153

View full text Add to dashboard Cite

Over the past few years, massive amounts of world knowledge have been accumulated in publicly available knowledge bases, such as Freebase, NELL, and YAGO. Yet despite their seemingly huge size, these knowledge bases are greatly incomplete. For example, over 70% of people included in Freebase have no known place of birth, and 99% have no known ethnicity. In this paper, we propose a way to leverage existing Web-search-based question-answering technology to fill in the gaps in knowledge bases in a targeted way. In particular, for each entity attribute, we learn the best set of queries to ask, such that the answer snippets returned by the search engine are most likely to contain the correct value for that attribute. For example, if we want to find Frank Zappa's mother, we could ask the query who is the mother of Frank Zappa. However, this is likely to return 'The Mothers of Invention', which was the name of his band. Our system learns that it should (in this case) add disambiguating terms, such as Zappa's place of birth, in order to make it more likely that the search results contain snippets mentioning his mother. Our system also learns how many different queries to ask for each attribute, since in some cases, asking too many can hurt accuracy (by introducing false positives). We discuss how to aggregate candidate answers across multiple queries, ultimately returning probabilistic predictions for possible values for each attribute. Finally, we evaluate our system and show that it is able to extract a large number of facts with high confidence.

show abstract

“…Such a setup would be more adaptive with respect to the number of queries asked and could thus be potentially more effective at avoiding to ask too many queries (cf. [9]). …”

Section: Discussionmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Knowledge base completion via search-based question answering

West

Gabrilovich

Murphy

et al. 2014

Proceedings of the 23rd International Conference on World Wide Web

247

153

View full text Add to dashboard Cite

show abstract

“…Similar systems optimize the use of information extraction programs to add missing data values to an existing database [Kanani and McCallum 2012]. These techniques generally improve execution time or storage capacity by processing only those "promising" documents in the collection that contain information about the database relations, instead of the whole collection.…”

Section: Related Workmentioning

confidence: 99%

“…Researchers have noticed the overheads and costs of curating and organizing large datasets [Dong et al 2013;Kanani and McCallum 2012;Jain et al 2008a]. For example, some researchers have recently considered the problem of selecting datasets for fusion such that the marginal cost of acquiring and processing a new dataset does not exceed its marginal gain, where cost and gain are measured using the same metric, such as U.S. dollars [Dong et al 2013].…”

Section: Costs Of Concept Extractionsmentioning

confidence: 99%

Cost-Effective Conceptual Design for Information Extraction

Termehchy

Vakilian

Chodpathumwan

et al. 2015

ACM Trans. Database Syst.

View full text Add to dashboard Cite

It is well established that extracting and annotating occurrences of entities in a collection of unstructured text documents with their concepts improves the effectiveness of answering queries over the collection. However, it is very resource intensive to create and maintain large annotated collections. Since the available resources of an enterprise are limited and/or its users may have urgent information needs, it may have to select only a subset of relevant concepts for extraction and annotation. We call this subset a conceptual design for the annotated collection. In this article, we introduce and formally define the problem of cost-effective conceptual design where, given a collection, a set of relevant concepts, and a fixed budget, one likes to find a conceptual design that most improves the effectiveness of answering queries over the collection. We provide efficient algorithms for special cases of the problem and prove it is generally NP-hard in the number of relevant concepts. We propose three efficient approximations to solve the problem: a greedy algorithm, an approximate popularity maximization (APM for short), and approximate annotation-benefit maximization (AAM for short). We show that, if there are no constraints regrading the overlap of concepts, APM is a fully polynomial time approximation scheme. We also prove that if the relevant concepts are mutually exclusive, the greedy algorithm delivers a constant approximation ratio if the concepts are equally costly, APM has a constant approximation ratio, and AAM is a fully polynomial-time approximation scheme. Our empirical results using a Wikipedia collection and a search engine query log validate the proposed formalization of the problem and show that APM and AAM efficiently compute conceptual designs. They also indicate that, in general, APM delivers the optimal conceptual designs if the relevant concepts are not mutually exclusive. Also, if the relevant concepts are mutually exclusive, the conceptual designs delivered by AAM improve the effectiveness of answering queries over the collection more than the solutions provided by APM.

show abstract

“…Researchers have proposed several techniques to reduce the execution time of SQL queries over existing databases whose information comes from concept and relation extraction programs [13,15]. Similar systems optimize the use of information extraction programs to add missing data values to an existing database [16]. These techniques generally improve execution time or storage capacity by processing only the "promising" documents in the collection that contain the information about the database relations, instead of the whole collection.…”

Section: Related Workmentioning

confidence: 99%

Which concepts are worth extracting?

Termehchy

Vakilian

Chodpathumwan

et al. 2014

Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data

View full text Add to dashboard Cite

It is well established that extracting and annotating occurrences of entities in a collection of unstructured text documents with their concepts improve the effectiveness of answering queries over the collection. However, it is very resource intensive to create and maintain large annotated collections. Since the available resources of an enterprise are limited and/or its users may have urgent information needs, it may have to select only a subset of relevant concepts for extraction and annotation. We call this subset a conceptual design for the annotated collection. In this paper, we introduce the problem of cost effective conceptual design, where given a collection, a set of relevant concepts, and a fixed budget, one likes to find a conceptual design that improves the effectiveness of answering queries over the collection the most. We prove that the problem is generally NP-hard in the number of relevant concepts and propose two efficient approximation algorithms to solve the problem: Approximate Popularity Maximization (APM for short) and Approximate Annotation-benefit Maximization (AAM for short). We show that if there is not any constraints regrading the overlap of concepts, APM is a fully polynomial time approximation scheme. We also prove that if the relevant concepts are mutually exclusive, APM has a constant approximation ratio and AAM is a fully polynomial time approximation scheme. Our empirical results using Wikipedia collection and a search engine query log validate the proposed formalization of the problem and show that APM and AAM efficiently compute conceptual designs. They also indicate that in general APM delivers the optimal conceptual designs if the relevant concepts are not mutually exclusive. Also, if the relevant concepts are mutually exclusive, the conceptual designs delivered by AAM improve the effectiveness of answering queries over the collection more than the solutions provided by APM.

show abstract

Selecting actions for resource-bounded information extraction using reinforcement learning

Cited by 19 publications

References 10 publications

Knowledge base completion via search-based question answering

Knowledge base completion via search-based question answering

Cost-Effective Conceptual Design for Information Extraction

Which concepts are worth extracting?

Contact Info

Product

Resources

About