A Nanopublishing Architecture for Biomedical Data

Sernadela, Pedro; Horst, Eelke van der; Thompson, Mark; Lopes, Pedro; Roos, Marco; Oliveira, José Luís

doi:10.1007/978-3-319-07581-5_33

Cited by 6 publications

(6 citation statements)

References 10 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…In addition, we plan to include additional resources such as WordNet for query expansion to increase the chance of finding correct matches from ontologies or coding systems. Finally, we also want to publish mappings as linked data, for example as nanopublications ( 26 ) ( http://nanopub.org ), so they can be easily reused. SORTA is available as a service running at http://molgenis.org/sorta .…”

Section: Discussionmentioning

confidence: 99%

SORTA: a system for ontology-based re-coding and technical annotation of biomedical phenotype data

et al. 2015

View full text Add to dashboard Cite

There is an urgent need to standardize the semantics of biomedical data values, such as phenotypes, to enable comparative and integrative analyses. However, it is unlikely that all studies will use the same data collection protocols. As a result, retrospective standardization is often required, which involves matching of original (unstructured or locally coded) data to widely used coding or ontology systems such as SNOMED CT (clinical terms), ICD-10 (International Classification of Disease) and HPO (Human Phenotype Ontology). This data curation process is usually a time-consuming process performed by a human expert. To help mechanize this process, we have developed SORTA, a computer-aided system for rapidly encoding free text or locally coded values to a formal coding system or ontology. SORTA matches original data values (uploaded in semicolon delimited format) to a target coding system (uploaded in Excel spreadsheet, OWL ontology web language or OBO open biomedical ontologies format). It then semi- automatically shortlists candidate codes for each data value using Lucene and n-gram based matching algorithms, and can also learn from matches chosen by human experts. We evaluated SORTA’s applicability in two use cases. For the LifeLines biobank, we used SORTA to recode 90 000 free text values (including 5211 unique values) about physical exercise to MET (Metabolic Equivalent of Task) codes. For the CINEAS clinical symptom coding system, we used SORTA to map to HPO, enriching HPO when necessary (315 terms matched so far). Out of the shortlists at rank 1, we found a precision/recall of 0.97/0.98 in LifeLines and of 0.58/0.45 in CINEAS. More importantly, users found the tool both a major time saver and a quality improvement because SORTA reduced the chances of human mistakes. Thus, SORTA can dramatically ease data (re)coding tasks and we believe it will prove useful for many more projects.Database URL: http://molgenis.org/sorta or as an open source download from http://www.molgenis.org/wiki/SORTA

show abstract

Section: Discussionmentioning

confidence: 99%

SORTA: a system for ontology-based re-coding and technical annotation of biomedical phenotype data

et al. 2015

View full text Add to dashboard Cite

show abstract

“…Nanopublications seem to be a particularly apt choice for structuring OpenAnn text annotations in the biomedical domain. Using nanopublications, the assertion, provenance, and metadata for a PubAnnotation are clearly demarcated into named graphs, which can retrieved, validated, and viewed by a growing set of data publication tools [ 9 ].…”

Section: Discussionmentioning

confidence: 99%

“…This generalization of the model is particularly pertinent to collaborative annotation scenarios; exposing linguistic annotations in the de facto language of the Semantic Web, the W3C's Resource Description Framework (RDF), provides several advantages that we have previously described [ 6 ]. We further demonstrate that the model can be integrated with the nanopublications model [ 7 , 8 ], facilitating their use in a growing set of data publication tools [ 9 ].…”

Section: Introductionmentioning

confidence: 99%

Interoperability of text corpus annotations with the semantic web

2015

View full text Add to dashboard Cite

“…Веб может предоставить перспективные решения для экстернализации. Одной из моделей для публикации семантически богатых научных данных являются нанопубликации [16]. Нанопубликации сериализуются в RDF-формат, и это позволяет использовать их согласно принципам связных данных.…”

Section: проблемы существующих способовunclassified