The KNOWITALL system aims to automate the tedious process of extracting large collections of facts (e.g., names of scientists or politicians) from the Web in an unsupervised, domain-independent, and scalable manner. The paper presents an overview of KNOW-ITALL's novel architecture and design principles, emphasizing its distinctive ability to extract information without any hand-labeled training examples. In its first major run, KNOW-ITALL extracted over 50,000 facts, but suggested a challenge: How can we improve KNOW-ITALL's recall and extraction rate without sacrificing precision? This paper presents three distinct ways to address this challenge and evaluates their performance. Pattern Learning learns domain-specific extraction rules, which enable additional extractions. Subclass Extraction automatically identifies sub-classes in order to boost recall. List Extraction locates lists of class instances, learns a "wrapper" for each list, and extracts elements of each list. Since each method bootstraps from KNOWITALL's domainindependent methods, the methods also obviate hand-labeled training examples. The paper reports on experiments, focused on named-entity extraction, that measure the relative efficacy of each method and demonstrate their synergy. In concert, our methods gave KNOW-ITALL a 4-fold to 8-fold increase in recall, while maintaining high precision, and discovered over 10,000 cities missing from the Tipster Gazetteer.
Manually querying search engines in order to accumulate a large body of factual information is a tedious, error-prone process of piecemeal search. Search engines retrieve and rank potentially relevant documents for human perusal, but do not extract facts, assess confidence, or fuse information from multiple documents. This paper introduces KNOWITALL, a system that aims to automate the tedious process of extracting large collections of facts from the web in an autonomous, domain-independent, and scalable manner.The paper describes preliminary experiments in which an instance of KNOWITALL, running for four days on a single machine, was able to automatically extract 54,753 facts. KNOWITALL associates a probability with each fact enabling it to trade off precision and recall. The paper analyzes KNOWITALL's architecture and reports on lessons learned for the design of large-scale information extraction systems.
Because of their high surface areas, crystallinity, and tunable properties, metal–organic frameworks (MOFs) have attracted intense interest as next-generation materials for gas capture and storage. While much effort has been devoted to the discovery of new MOFs, a vast catalog of existing MOFs resides within the Cambridge Structural Database (CSD), many of whose gas uptake properties have not been assessed. Here we employ data mining and automated structure analysis to identify, “cleanup,” and rapidly predict the hydrogen storage properties of these compounds. Approximately 20 000 candidate compounds were generated from the CSD using an algorithm that removes solvent/guest molecules. These compounds were then characterized with respect to their surface area and porosity. Employing the empirical relationship between excess H2 uptake and surface area, we predict the theoretical total hydrogen storage capacity for the subset of ∼4000 compounds exhibiting nontrivial internal porosity. Our screening identifies several overlooked compounds having high theoretical capacities; these compounds are suggested as targets of opportunity for additional experimental characterization. More importantly, screening reveals that the relationship between gravimetric and volumetric H2 density is concave downward, with maximal volumetric performance occurring for surface areas of 3100–4800 m2/g. We conclude that H2 storage in MOFs will not benefit from further improvements in surface area alone. Rather, discovery efforts should aim to achieve moderate mass densities and surface areas simultaneously, while ensuring framework stability upon solvent removal.
The Web contains a vast amount of structured information such as HTML tables, HTML lists and deep-web databases; there is enormous potential in combining and re-purposing this data in creative ways. However, integrating data from this relational web raises several challenges that are not addressed by current data integration systems or mash-up tools. First, the structured data is usually not published cleanly and must be extracted (say, from an HTML list) before it can be used. Second, due to the vastness of the corpus, a user can never know all of the potentially-relevant databases ahead of time (much less write a wrapper or mapping for each one); the source databases must be discovered during the integration process. Third, some of the important information regarding the data is only present in its enclosing web page and needs to be extracted appropriately.This paper describes Octopus, a system that combines search, extraction, data cleaning and integration, and enables users to create new data sets from those found on the Web. The key idea underlying Octopus is to offer the user a set of best-effort operators that automate the most labor-intensive tasks. For example, the Search operator takes a search-style keyword query and returns a set of relevance-ranked and similarity-clustered structured data sources on the Web; the Context operator helps the user specify the semantics of the sources by inferring attribute values that may not appear in the source itself, and the Extend operator helps the user find related sources that can be joined to add new attributes to a table. Octopus executes some of these operators automatically, but always allows the user to provide feedback and correct errors. We describe the algorithms underlying each of these operators and experiments that demonstrate their efficacy.
The MapReduce distributed programming framework has become popular, despite evidence that current implementations are inefficient, requiring far more hardware than a traditional relational databases to complete similar tasks. MapReduce jobs are amenable to many traditional database query optimizations (B+Trees for selections, column-storestyle techniques for projections, etc), but existing systems do not apply them, substantially because free-form user code obscures the true data operation being performed. For example, a selection in SQL is easily detected, but a selection in a MapReduce program is embedded in Java code along with lots of other program logic. We could ask the programmer to provide explicit hints about the program's data semantics, but one of MapReduce's attractions is precisely that it does not ask the user for such information. This paper covers Manimal, which automatically analyzes MapReduce programs and applies appropriate dataaware optimizations, thereby requiring no additional help at all from the programmer. We show that Manimal successfully detects optimization opportunities across a range of data operations, and that it yields speedups of up to 1,121% on previously-written MapReduce programs.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.