The exponential growth of the biomedical literature is making the need for efficient, accurate text-mining tools increasingly clear. The identification of named biological entities in text is a central and difficult task. We have developed an efficient algorithm and implementation of a dictionary-based approach to named entity recognition, which we here use to identify names of species and other taxa in text. The tool, SPECIES, is more than an order of magnitude faster and as accurate as existing tools. The precision and recall was assessed both on an existing gold-standard corpus and on a new corpus of 800 abstracts, which were manually annotated after the development of the tool. The corpus comprises abstracts from journals selected to represent many taxonomic groups, which gives insights into which types of organism names are hard to detect and which are easy. Finally, we have tagged organism names in the entire Medline database and developed a web resource, ORGANISMS, that makes the results accessible to the broad community of biologists. The SPECIES software is open source and can be downloaded from http://species.jensenlab.org along with dictionary files and the manually annotated gold-standard corpus. The ORGANISMS web resource can be found at http://organisms.jensenlab.org.
The study of ecosystem functioning – the role which organisms play in an ecosystem – is becoming increasingly important in marine ecological research. The functional structure of a community can be represented by a set of functional traits assigned to behavioural, reproductive and morphological characteristics. The collection of these traits from the literature is however a laborious and time-consuming process, and gaps of knowledge and restricted availability of literature are a common problem. Trait data are not yet readily being shared by research communities, and even if they are, a lack of trait data repositories and standards for data formats leads to the publication of trait information in forms which cannot be processed by computers. This paper describes Polytraits (http://polytraits.lifewatchgreece.eu), a database on biological traits of marine polychaetes (bristle worms, Polychaeta: Annelida). At present, the database contains almost 20,000 records on morphological, behavioural and reproductive characteristics of more than 1,000 marine polychaete species, all referenced by literature sources. All data can be freely accessed through the project website in different ways and formats, both human-readable and machine-readable, and have been submitted to the Encyclopedia of Life for archival and integration with trait information from other sources.
Summary: The association of organisms to their environments is a key issue in exploring biodiversity patterns. This knowledge has traditionally been scattered, but textual descriptions of taxa and their habitats are now being consolidated in centralized resources. However, structured annotations are needed to facilitate large-scale analyses. Therefore, we developed ENVIRONMENTS, a fast dictionary-based tagger capable of identifying Environment Ontology (ENVO) terms in text. We evaluate the accuracy of the tagger on a new manually curated corpus of 600 Encyclopedia of Life (EOL) species pages. We use the tagger to associate taxa with environments by tagging EOL text content monthly, and integrate the results into the EOL to disseminate them to a broad audience of users.Availability and implementation: The software and the corpus are available under the open-source BSD and the CC-BY-NC-SA 3.0 licenses, respectively, at http://environments.hcmr.grContact: pafilis@hcmr.gr or lars.juhl.jensen@cpr.ku.dkSupplementary information: Supplementary data are available at Bioinformatics online.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.