Abstract. The wide-spread applications of document digitization have lead to the use of structured digital representation methods such as the XML language. Extraction methodologies for the formatting metadata can be used on such structured documents for enhancing their accessibility, including augmented audio representation of documents. To the best of our knowledge, an effort has yet to be made to produce an automatic extraction system of semantic information of the document formatting, solely from document layout, without the use of natural language processing. In this study a corpus of XML representations of several issues of a Greek newspaper is used in order to create and evaluate a semantic classifier of text formatting, based on Bayesian Networks.
Part 5: Languages and OntologiesInternational audienceThe existing information extraction approaches are generally analyzed and then categorized into several groups based on the superiority and the intelligence of the approaches as well as their capability to solve complex problems. Two practical approaches are provided to clarify how to use the information extraction solutions to obtain the valuable information from numerous reviews. The first approach is to support the front-end services in the EASY-IMP project. The customer preference and the optimum interest of customers is determined based on TF-IDF approach. Roughly 100,000 pages have been analyzed and the customer preference is studied based on the most relevant keywords. However, TF-IDF approach limits on the capability to provide the personalized infromation, which can only obtain the restricted information based on weights calcualtion. In order to extract more efficient customerized infromation, an opinion mining algorithm is proposed. The proposed algorithm aims to obtain sufficient information extraction results and reduce the complexity and running time of information extraction by jointly discovering the main opinion mining elements. The analyzed reviews show that the proposed algorithm can effectively and simultaneously identify the main elements
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.