Ontology-based extraction and structuring of information from data-rich unstructured documents

Embley, David W.; Campbell, Douglas M.; Smith, Randy D.; Liddle, Stephen W.

doi:10.1145/288627.288641

Cited by 108 publications

(58 citation statements)

References 24 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In previous works we showed howtoimprove performance in document analysis and understanding by using semantic context models [WM00]. One of the first ideas for using domain ontologies in information extraction have been described by [ECSL98]. Information extraction as such has been implemented by regarding [AI99].…”

Section: R Elated Workmentioning

confidence: 99%

Professional Knowledge Management

Althoff¹,

Dengel²,

Bergmann³

et al. 2005

Lecture Notes in Computer Science

View full text Add to dashboard Cite

publishes this series in order to make available to a broad public recent findings in informatics (i.e. computer science and information systems), to document conferences that are organized in cooperation with GI and to publish the annual GI Award dissertation.Broken down into the fields of • Seminar • Proceedings • Dissertations • Thematics current topics are dealt with from the fields of research and development, teaching and further training in theory and practice. The Editorial Committee uses an intensive review process in order to ensure the high level of the contributions.The volumes are published in German or English.Information: http://www.gi-ev.de/service/publikationen/lni/ The WM2009 is the 5th conference of the bi-annual series "Professional Knowledge Management". A broad integrative overview on currents trends and new insights in organisational, social and technical aspects of knowledge management are provided to participants from academia and practice. The conference programme is composed of 11 Workshops, invited talks, tutorials, a poster and demo session and an exhibition. It also provides space to discuss problems and challenges, share experiences and for networking. This volume contains refereed contributions from the workshops. 145ISSN 1617-5468 ISBN 978-3-88579-239-0 PrefaceKnowledge has become more and more an imperative key success factor for each company. It plays a key role in many business processes, in product development and advanced customer satisfaction. This is particularly important as products are becoming more complex and processesm ore knowledge-intensive. Many companies are going to realize their knowledge potential and improve their knowledge management. New technologies like those known under the "Web2.0" label and new methods like intellectual capital statements support the improvement of knowledge management.The 5 th conference on Professional Knowledge Management provides a broad integrative overview of organizational,c ultural, social and technical aspects on knowledge management. Focus of the conference is bringing together different research disciplines and sharing experiences gained in the different areas where knowledge management is being applied. In particular the 5 th conference on Professional Knowledge Management consists of five tutorials and eleven workshops focusingo n current trends. Topics of these tutorials andworkshops include knowledge services and mashups, knowledge and social networks, convergence of knowledge management and e-learning, productive knowledge work, personal knowledge management, experience management, measuring and benchmarking the economical success of knowledge management, integrating knowledge base systems as well as knowledge management approaches specialised for SME, for companies from the financial sector, or for combination with enterprise communication.Ninety-five contributions were submitted to the workshops of which 45 were accepted as long papers and 20 as short paper. These 65 contributions are collected in this proceedin...

show abstract

Section: R Elated Workmentioning

confidence: 99%

Professional Knowledge Management

Althoff¹,

Dengel²,

Bergmann³

et al. 2005

Lecture Notes in Computer Science

View full text Add to dashboard Cite

show abstract

“…The essence of this problem boils down to organizing the concepts and concept instances in a HTML document into a (labeled) semantic partition tree. There are a number of areas related to this problem, namely, XML schema discovery [15,26,14,27], schema inference from HTML documents [8,2], wrapper construction [17,7,25], record boundary detection in HTML documents [12,11,10,4], and semantic annotation of HTML documents [18,19,9] However, our approach departs from all the related works above in several respects. Firstly, our main focus is on template-based content-rich HTML documents.…”

Section: Related Workmentioning

confidence: 99%

Automatic Annotation of Content-Rich HTML Documents: Structural and Semantic Analysis

Mukherjee

Yang

Ramakrishnan

2003

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Abstract. Although RDF/XML has been widely recognized as the standard vehicle for representing semantic information on the Web, an enormous amount of semantic data is still being encoded in HTML documents that are designed primarily for human consumption and not directly amenable to machine processing. This paper seeks to bridge this semantic gap by addressing the fundamental problem of automatically annotating HTML documents with semantic labels. Exploiting a key observation that semantically related items exhibit consistency in presentation style as well as spatial locality in template-based content-rich HTML documents, we have developed a novel framework for automatically partitioning such documents into semantic structures. Our framework tightly couples structural analysis of documents with semantic analysis incorporating domain ontologies and lexical databases such as WordNet. We present experimental evidence of the effectiveness of our techniques on a large collection of HTML documents from various news portals.

show abstract

“…98-1 24.02 [Embley, et al, 1998] [Woods 2000] presents positive results for the creation of large-scale subsumption (i e abstraction) hierarchies from lexical and phrasal analysis of free text. By analyzing relationships among constituents of phrases and compound morphemes, lexical strings from text can be automatically placed at appropriate levels of generality within a hierarchy encoding subsumption relationships.…”

mentioning

confidence: 99%

Ontology-Based Information Extraction from Free-Form Text

Braun¹

2000

View full text Add to dashboard Cite

Report developed under SBIR contract. In this Phase I SBIR research we demonstrated the feasibility of an information extraction (IE) system that can leverage semantic representations to significantly increase end-to-end recall for the IE task while maintaining or improving precision. Our end-to-end Ontology-Based IE (OBIE) system combines machine learning techniques with a novel architecture built around a shared domain ontology. This architecture enables interaction between different levels of the IE processing stream simultaneously through the shared ontology. By incorporating hierarchical knowledge in their learning algorithms, IE modules can perform their extraction tasks with greater depth and accuracy. Bootstrapping algorithms were extended to automatically learn the ontology of a new domain, to assist in training the IE components, and to reduce the burden of annotation on the end-user. Broad-' coverage and rare-case extraction rules were augmented by classifiers induced from the trained ontology to shore up the precision typically lost by such rules. Performance metrics allow a preliminary characterization of recall and precision gains enabled by the proposed architecture. Our Phase I research and development of a proof-of-concept prototype demonstrated the feasibility and utility of OBIE's ontology-based IE capability and provides a solid foundation for our Phase II implementation. REPORT DOCUMENTATION PAGE Form Approved OMB No. 0704-0188Public reporting burden for this collection of information is estimated to average 1 hour per response, including the time for reviewing instructions, searching existing data sources gathering and maintaining the data needed, and completing and reviewing the collection of information. ABSTRACT (Maximum 200 words)12b. DISTRIBUTION CODE Report developed under SBIR contract. In this Phase I SBIR research we demonstrated the feasibility of an information extraction (IE) system that can leverage semantic representations to significantly increase end-to-end recall for the IE task while maintaining or improving precision. Our end-to-end Ontology-Based IE (OBIE) system combines machine learning techniques with a novel architecture built around a shared domain ontology. This architecture enables interaction between different levels of the IE processing stream simultaneously through the shared ontology. By incorporating hierarchical knowledge in their learning algorithms, IE modules can perform their extraction tasks with greater depth and accuracy. Bootstrapping algorithms were extended to automatically learn the ontology of a new domain, to assist in training the IE components, and to reduce the burden of annotation on the end-user. Broad-coverage and rare-case extraction rules were augmented by classifiers induced from the trained ontology to shore up the precision typically lost by such rules. Performance metrics allow a preliminary characterization of recall and precision gains enabled by the proposed architecture. Our Phase I research and development of a proof-of-concept...

show abstract

Ontology-based extraction and structuring of information from data-rich unstructured documents

Cited by 108 publications

References 24 publications

Professional Knowledge Management

Professional Knowledge Management

Automatic Annotation of Content-Rich HTML Documents: Structural and Semantic Analysis

Ontology-Based Information Extraction from Free-Form Text

Contact Info

Product

Resources

About