The CALBC initiative aims to provide a large-scale biomedical text corpus that contains semantic annotations for named entities of different kinds. The generation of this corpus requires that the annotations from different automatic annotation systems be harmonized. In the first phase, the annotation systems from five participants (EMBL-EBI, EMC Rotterdam, NLM, JULIE Lab Jena, and Linguamatics) were gathered. All annotations were delivered in a common annotation format that included concept identifiers in the boundary assignments and that enabled comparison and alignment of the results. During the harmonization phase, the results produced from those different systems were integrated in a single harmonized corpus ("silver standard" corpus) by applying a voting scheme. We give an overview of the processed data and the principles of harmonization--formal boundary reconciliation and semantic matching of named entities. Finally, all submissions of the participants were evaluated against that silver standard corpus. We found that species and disease annotations are better standardized amongst the partners than the annotations of genes and proteins. The raw corpus is now available for additional named entity annotations. Parts of it will be made available later on for a public challenge. We expect that we can improve corpus building activities both in terms of the numbers of named entity classes being covered, as well as the size of the corpus in terms of annotated documents.
BackgroundCompetitions in text mining have been used to measure the performance of automatic text processing solutions against a manually annotated gold standard corpus (GSC). The preparation of the GSC is time-consuming and costly and the final corpus consists at the most of a few thousand documents annotated with a limited set of semantic groups. To overcome these shortcomings, the CALBC project partners (PPs) have produced a large-scale annotated biomedical corpus with four different semantic groups through the harmonisation of annotations from automatic text mining solutions, the first version of the Silver Standard Corpus (SSC-I). The four semantic groups are chemical entities and drugs (CHED), genes and proteins (PRGE), diseases and disorders (DISO) and species (SPE). This corpus has been used for the First CALBC Challenge asking the participants to annotate the corpus with their text processing solutions.ResultsAll four PPs from the CALBC project and in addition, 12 challenge participants (CPs) contributed annotated data sets for an evaluation against the SSC-I. CPs could ignore the training data and deliver the annotations from their genuine annotation system, or could train a machine-learning approach on the provided pre-annotated data. In general, the performances of the annotation solutions were lower for entities from the categories CHED and PRGE in comparison to the identification of entities categorized as DISO and SPE. The best performance over all semantic groups were achieved from two annotation solutions that have been trained on the SSC-I.The data sets from participants were used to generate the harmonised Silver Standard Corpus II (SSC-II), if the participant did not make use of the annotated data set from the SSC-I for training purposes. The performances of the participants’ solutions were again measured against the SSC-II. The performances of the annotation solutions showed again better results for DISO and SPE in comparison to CHED and PRGE.ConclusionsThe SSC-I delivers a large set of annotations (1,121,705) for a large number of documents (100,000 Medline abstracts). The annotations cover four different semantic groups and are sufficiently homogeneous to be reproduced with a trained classifier leading to an average F-measure of 85%. Benchmarking the annotation solutions against the SSC-II leads to better performance for the CPs’ annotation solutions in comparison to the SSC-I.
Identifying relationships between hitherto unrelated entities in different ontologies is the key task of ontology alignment. An alignment is either manually created by domain experts or automatically by an alignment system. In recent years, several alignment systems have been made available, each using its own set of methods for relation detection. To evaluate and compare these systems, typically a manually created alignment is used, the so-called reference alignment. Based on our experience with several of these reference alignments we derived requirements and translated them into simple quality checks to ensure the alignments’ validity and also their reusability. In this article, these quality checks are applied to a standard reference alignment in the biomedical domain, the Ontology Alignment Evaluation Initiative Anatomy track reference alignment, and two more recent data sets covering multiple domains, including but not restricted to anatomy and biology.
The importance of livestock-associated methicillin-resistant Staphylococcus aureus (LA-MRSA) as an infectious agent for humans has increased in recent years in Germany. Although it is well known that the prevalence of MRSA in pig farms is high, risk factors for the presence of MRSA in herds of fattening pigs are still poorly understood. The aim of this study was to evaluate available data from previous studies on MRSA in fattening pigs in a meta-analysis to answer the question: What are the factors associated with the occurrence of MRSA in fattening pig herds? The studies on MRSA in pigs that were identified by literature research were heterogeneous with respect to the risk factors investigated and the type of herds focused on. Therefore we decided to carry out a pooling analysis on herd level rather than a typical meta-analysis. Eligible herd data were identified based on the published literature and communication with the authors. The final data set covered 400 fattening pig herds from 10 different studies and 12 risk factors. The prevalence of MRSA in the 400 fattening pig herds was 53.5%. Data were analyzed using generalized estimating equations (GEE). The resulting multivariate model confirmed previously identified risk factors for MRSA in pig herds (herd size and herd type). It also identified further risk factors: group treatment of fattening pigs with antimicrobial drugs (OR=1.79) and housing fattening pig herds on at least partially slatted floors (OR=2.39) compared to plain floor. In contrast, according to the model, fattening pig herds on farms keeping other livestock along with pigs were less likely to harbor MRSA (OR=0.54). The results underline the benefits from a pooling analysis and cooperative re-evaluation of published data.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2025 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.