New scientific concepts, interpreted broadly, are continuously introduced in the literature, but relatively few concepts have a long‐term impact on society. The identification of such concepts is a challenging prediction task that would help multiple parties—including researchers and the general public—focus their attention within the vast scientific literature. In this paper we present a system that predicts the future impact of a scientific concept, represented as a technical term, based on the information available from recently published research articles. We analyze the usefulness of rich features derived from the full text of the articles through a variety of approaches, including rhetorical sentence analysis, information extraction, and time‐series analysis. The results from two large‐scale experiments with 3.8 million full‐text articles and 48 million metadata records support the conclusion that full‐text features are significantly more useful for prediction than metadata‐only features and that the most accurate predictions result from combining the metadata and full‐text features. Surprisingly, these results hold even when the metadata features are available for a much larger number of documents than are available for the full‐text features.
How many guns are there in the United States? What is the incidence of breast cancer? Is a billion dollar budget cut large or small? Advocates of scientific and civic literacy are concerned with improving how people estimate and comprehend risks, measurements, and frequencies, but relatively little progress has been made in this direction. In this article we describe and test a framework to help people comprehend numerical measurements in everyday settings through simple sentences, termed perspectives, that employ ratios, ranks, and unit changes to make them easier to understand. We use a crowdsourced system to generate perspectives for a wide range of numbers taken from online news articles. We then test the effectiveness of these perspectives in three randomized, online experiments involving over 3,200 participants. We find that perspective clauses substantially improve people's ability to recall measurements they have read, estimate ones they have not, and detect errors in manipulated measurements. We see this as the first of many steps in leveraging digital platforms to improve numeracy among online readers.
We introduce the REEL (RElation Extraction Learning) framework, an open source framework that facilitates the development and evaluation of relation extraction systems over text collections. To define a relation extraction system for a new relation and text collection, users only need to specify the parsers to load the collection, the relation and its constraints, and the learning and extraction techniques to be used. This makes REEL a powerful framework to enable the deployment and evaluation of relation extraction systems for both application building and research.
Knowledge bases (KBs) are the backbone of many ubiquitous applications and are thus required to exhibit high precision. However, for KBs that store subjective attributes of entities, e.g., whether a movie is kid friendly, simply estimating precision is complicated by the inherent ambiguity in measuring subjective phenomena. In this work, we develop a method for constructing KBs with tunable precision-i.e., KBs that can be made to operate at a specific false positive rate, despite storing both difficult-to-evaluate subjective attributes and more traditional factual attributes. The key to our approach is probabilistically modeling user consensus with respect to each entity-attribute pair, rather than modeling each pair as either True or False. Uncertainty in the model is explicitly represented and used to control the KB's precision. We propose three neural networks for fitting the consensus model and evaluate each one on data from Google Maps-a large KB of locations and their subjective and factual attributes. The results demonstrate that our learned models are well-calibrated and thus can successfully be used to control the KB's precision. Moreover, when constrained to maintain 95% precision, the best consensus model matches the F-score of a baseline that models each entity-attribute pair as a binary variable and does not support tunable precision. When unconstrained, our model dominates the same baseline by 12% F-score. Finally, we perform an empirical analysis of attribute-attribute correlations and show that leveraging them effectively contributes to reduced uncertainty and better performance in attribute prediction. 1 https://www.netflix.com/ arXiv:1905.12807v3 [cs.AI] 31 Jul 2019relies on a KB of user-movie-rating triples, Google Maps 2 is built atop a KB of geographic points of interest and PubMed 3 offers a handful of tools that operate on its citation KB of biomedical research. Wikipedia 4 is both a KB (with respect to infoboxes) and a service in and of itself and has even inspired and facilitated the creation of additional KBs like YAGO and DBPedia [31,17].In KBs that support real-world decision making, maintaining high precision is often critical. As an example, consider organizing a lunch meeting and issuing a KB query for cafes that are good for groups. In the KB's response, it is far better to omit a few true positives (i.e., cafes that are good for groups) than it is to return any false positives (i.e., cafes that are not good for groups), because choosing to visit a false positive could lead to a highly unproductive meeting. The importance of high precision can be even more pronounced for queries about factual datalike a query for restaurants that are wheelchair accessible-where false positives may be inaccessible by the user issuing the query. Since most KBs are built using noisy automated methods, special consideration must be paid. Previous work echos this concern: in addition to employing trained automated components for data collection and prediction of missing values, systems that build KBs of...
Information extraction systems discover structured information in natural language text. Having information in structured form enables much richer querying and data mining than possible over the natural language text. However, information extraction is a computationally expensive task, and hence improving the efficiency of the extraction process over large text collections is of critical interest. In this paper, we focus on an especially valuable family of text collections, namely, the so-called deep-web text collections, whose contents are not crawlable and are only available via querying. Important steps for efficient information extraction over deep-web text collections (e.g., selecting the collections on which to focus the extraction effort, based on their contents; or learning which documents within these collections-and in which order-to process, based on their words and phrases) require having a representative document sample from each collection. These document samples have to be collected by querying the deepweb text collections, an expensive process that renders impractical the existing sampling approaches developed for other data scenarios. In this paper, we systematically study the space of query-based document sampling techniques for information extraction over the deep web. Specifically, we consider (i) alternative query execution schedules, which vary on how they account for the query effectiveness, and (ii) alternative document retrieval and processing schedules, which vary on how they distribute the extraction effort over documents. We report the results of the first large-scale experimental evaluation of sampling techniques for information extraction over the deep web. Our results show the merits and limitations of the alternative query execution and document retrieval and processing strategies, and provide a roadmap for addressing this critically important building block for efficient, scalable information extraction.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2025 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.