Peerasak INTARAPAIBOON†a) , Student Member, Ekawit NANTAJEEWARAWAT †b) , and Thanaruk THEERAMUNKONG †c) , Members SUMMARY Based on sliding-window rule application and extraction filtering, we present a framework for extracting multi-slot frames describing chemical reactions from Thai free text with unknown target-phrase boundaries. A supervised rule learning algorithm is employed for automatic construction of pattern-based extraction rules from hand-tagged training phrases. A filtering method is devised for removal of incorrect extraction results based on features observed from text portions appearing between adjacent slot fillers in source documents. Extracted reaction frames are represented as concept expressions in description logics and are used as metadata for document indexing. A document knowledge base supporting semantics-based information retrieval is constructed by integrating document metadata with domain-specific ontologies. key words: information extraction, semantics-based information retrieval, ontology, description logics, automated reasoning IntroductionIn traditional keyword-based information retrieval systems, retrieval results are determined solely by appearance of query keywords in documents or in document indexes. In domain-specific applications, however, it is often desirable to describe an information need more precisely by specifying required relations between domain concepts. A user in the chemistry domain, for example, may wish to search for a document concerning "a chemical reaction that produces a compound containing a carbon atom." With the background knowledge that "propionaldehyde has some carbon atom as its component," the same user may furthermore expect the retrieval results to include a document containing a statement such as "propionaldehyde is obtained from the oxidation reaction of 1-propanol," which looks very different syntactically from the search condition specified above. It is anticipated that information extraction (IE) technology and recent development of machine-processable ontology languages, such as OWL [1], will contribute significantly to realization of such semantics-based information retrieval.In this paper, we present a framework for extracting multi-slot frames describing chemical reactions from chemistry thesis abstracts written in Thai. From input thesis abstracts, partially annotated with entity classes in a prepro- , is used as the core algorithm for constructing extraction rules. Pattern-based IE rules do not have ability to automatically segment input documents so that they can be applied only to relevant text portions. When applied to free text, a rule is usually applied to each individual sentence one by one. Identifying the boundary of a Thai sentence is, however, problematic. In Thai, there is no explicit end-sentence punctuation [4] and the notion of a sentence is unclear [2]. To apply IE rules without predetermining the boundaries of sentences and potential target phrases, rule application using sliding windows (RAW) is introduced. Using sliding...
Peerasak INTARAPAIBOON†a) , Student Member, Ekawit NANTAJEEWARAWAT †b) , and Thanaruk THEERAMUNKONG †c) , Members SUMMARY Due to the limitations of language-processing tools for the Thai language, pattern-based information extraction from Thai documents requires supplementary techniques. Based on sliding-window rule application and extraction filtering, we present a framework for extracting semantic information from medical-symptom phrases with unknown boundaries in Thai unstructured-text information entries. A supervised rule learning algorithm is employed for automatic construction of information extraction rules from hand-tagged training symptom phrases. Two filtering components are introduced: one uses a classification model to predict rule application across a symptom-phrase boundary based on instantiation features of rule internal wildcards, the other uses weighted classification confidence to resolve conflicts arising from overlapping extractions. In our experimental study, we focus our attention on two basic types of symptom phrasal descriptions: one is concerned with abnormal characteristics of some observable entities and the other with human-body locations at which primitive symptoms appear. The experimental results show that the filtering components improve precision while preserving recall satisfactorily.
An intuitionistic fuzzy set (IFS) is an extended version of a fuzzy set and is capable of representing hesitancy degrees. A framework for text classification is presented. Two main challenges are addressed: how to represent documents in terms of IFSs and how to obtain a pattern of each class from such an IFS-based representation. By using some existing similarity measures for IFSs, the proposed framework is applied to two benchmark datasets for text classification. The proposed framework yields satisfactory results when compared to decision tree, k-NN, naïve Bayes, and support vector machine classifiers.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.