Abstract. The availability in machine-readable form of descriptions of the structure of documents, as well as of the document discourse (e.g. the scientific discourse within scholarly articles), is crucial for facilitating semantic publishing and the overall comprehension of documents by both users and machines. In this paper we introduce DoCO, the Document Components Ontology, an OWL 2 DL ontology that provides a general-purpose structured vocabulary of document elements to describe both structural and rhetorical document components in RDF. In addition to giving a formal description of the ontology, this paper showcases its utility in practice in a variety of our own applications and other activities of the Semantic Publishing community that rely on DoCO to annotate and retrieve document components of scholarly articles.
The application of semantic technologies to the integration of biological data and the interoperability of bioinformatics analysis and visualization tools has been the common theme of a series of annual BioHackathons hosted in Japan for the past five years. Here we provide a review of the activities and outcomes from the BioHackathons held in 2011 in Kyoto and 2012 in Toyama. In order to efficiently implement semantic technologies in the life sciences, participants formed various sub-groups and worked on the following topics: Resource Description Framework (RDF) models for specific domains, text mining of the literature, ontology development, essential metadata for biological databases, platforms to enable efficient Semantic Web technology development and interoperability, and the development of applications for Semantic Web data. In this review, we briefly introduce the themes covered by these sub-groups. The observations made, conclusions drawn, and software development projects that emerged from these activities are discussed.
PDFX is a rule-based system designed to reconstruct the logical structure of scholarly articles in PDF form, regardless of their formatting style. The system's output is an XML document that describes the input article's logical structure in terms of title, sections, tables, references, etc. and also links it to geometrical typesetting markers in the original PDF, such as paragraph and column breaks.The key aspect of the presented approach is that the rule set used relies on relative parameters derived from font and layout specifics of each article, rather than on a template-matching paradigm. The system thus obviates the need for domain-or layout-specific tuning or prior training, exploiting only typographical conventions inherent in scientific literature. Evaluated against a significantly varied corpus of articles from nearly 2000 different journals, PDFX gives a 77.45 F1 measure for top-level heading identification and 74.03 for extracting individual bibliographic items. The service is freely available for use at
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.