FALDO: A semantic standard for describing the location of nucleotide and protein feature annotation

Bolleman, Jerven; Mungall, Christopher J.; Strozzi, Francesco; Baran, Joachim; Dumontier, Michel; Bonnal, Raoul J. P.; Buels, Robert; Hoehndorf, Robert; Fujisawa, Takatomo; Katayama, Toshi aki; Cock, Peter

doi:10.1101/002121

Cited by 9 publications

(12 citation statements)

References 28 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…To such as FALDO [Bolleman et al, 2016], which is also used in the RDF of Ensembl [Zerbino et al, 2018] and Ensembl Genomes [Kersey et al, 2018], to describe sequence positions, and the EDAM ontology [Ison et al, 2013] to describe sequence/signature matches.…”

Section: Methodsmentioning

confidence: 99%

HAMAP rules as SPARQL A portable annotation pipeline for genomes and proteomes

Bolleman

Castro

Baratin

et al. 2019

Preprint

Self Cite

View full text Add to dashboard Cite

Motivation: Genome and proteome annotation pipelines are generally custom built and therefore not easily reusable by other groups, which leads to duplication of effort, increased costs, and suboptimal results. One cost-effective way to increase the data quality in public databases is to encourage the adoption of annotation standards and technological solutions that enable the sharing of biological knowledge and tools for genome and proteome annotation.Results: We have translated the rules of our HAMAP proteome annotation pipeline to queries in the W3C standard SPARQL 1.1 syntax and applied them with two off-the-shelf SPARQL engines to UniProtKB/Swiss-Prot protein sequences described in RDF format. This approach is applicable to any genome or proteome annotation pipeline and greatly simplifies their reuse.Availability: HAMAP SPARQL rules and documentation are freely available for download from the HAMAP FTP site ftp://ftp.expasy.org/databases/hamap/hamap sparql.tar.gz under a CC-BY-ND 4.0 license. The annotations generated by the rules are under the CC-BY 4.0 license.Contact: hamap@sib.swiss Supplementary information: Supplementary data are included at the end of this document.

show abstract

Section: Methodsmentioning

confidence: 99%

HAMAP rules as SPARQL A portable annotation pipeline for genomes and proteomes

Bolleman

Castro

Baratin

et al. 2019

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…In general, due to the large variety of genomic annotations possible, it was decided that in the first iteration of a genomic RDF model, opaque Universally Unique IDentifiers (UUIDs) are to be used to represent sequence features. Each UUID would then be typed with its appropriate ontology, such as Sequence Ontology (SO), and sequence location would be specified using Feature Annotation Location Description Ontology (FALDO) [12,13]. FALDO was newly developed at the BioHackathon 2012 by representatives of UniProt [14], DDBJ [15] and genome scientists for the purpose of generically locating regions on the biological sequences (e.g., modification sites on a protein sequence, fuzzy promoter locations on a DNA sequence etc.).…”

Section: Reviewmentioning

confidence: 99%

BioHackathon series in 2011 and 2012: penetration of ontology and linked data in life science domains

et al. 2014

Self Cite

View full text Add to dashboard Cite

The application of semantic technologies to the integration of biological data and the interoperability of bioinformatics analysis and visualization tools has been the common theme of a series of annual BioHackathons hosted in Japan for the past five years. Here we provide a review of the activities and outcomes from the BioHackathons held in 2011 in Kyoto and 2012 in Toyama. In order to efficiently implement semantic technologies in the life sciences, participants formed various sub-groups and worked on the following topics: Resource Description Framework (RDF) models for specific domains, text mining of the literature, ontology development, essential metadata for biological databases, platforms to enable efficient Semantic Web technology development and interoperability, and the development of applications for Semantic Web data. In this review, we briefly introduce the themes covered by these sub-groups. The observations made, conclusions drawn, and software development projects that emerged from these activities are discussed.

show abstract

“…Furthermore, it has limited support for storing based-on provenance except for some experimental codes. FALDO's (13) only purpose is to unambiguously store genetic locations on a sequence. The Synthetic Biology Open Language (SBOL) (14) was successfully designed to describe complete synthetic constructs and the interactions between each of the elements.…”

Section: Introductionmentioning

confidence: 99%

Interoperable genome annotation with GBOL, an extendable infrastructure for functional data mining

Dam

Koehorst

Vik

et al. 2017

Preprint

View full text Add to dashboard Cite

BackgroundA standard structured format is used by the public sequence databases to present genome annotations. A prerequisite for a direct functional comparison is consistent annotation of the genetic elements with evidence statements. However, the current format provides limited support for data mining, hampering comparative analyses at large scale. Results The provenance of a genome annotation describes the contextual details and derivation history of the process that resulted in the annotation. To enable interoperability of genome annotations, we have developed the Genome Biology Ontology Language (GBOL) and associated infrastructure (GBOL stack). GBOL is provenance aware and thus provides a consistent representation of functional genome annotations linked to the provenance. GBOL is modular in design, extendible and linked to existing ontologies. The GBOL stack of supporting tools enforces consistency within and between the GBOL definitions in the ontology (OWL) and the Shape Expressions (ShEx) language describing the graph structure. Modules have been developed to serialize the linked data (RDF) and to generate a plain text format files. Conclusion The main rationale for applying formalized information models is to improve the exchange of information. GBOL uses and extends current ontologies to provide a formal representation of genomic entities, along with their properties and relations. The deliberate integration of data provenance in the ontology enables review of automatically obtained genome annotations at a large scale. The GBOL stack facilitates consistent usage of the ontology. RDF | Genome | AnnotationCorrespondence: maria.suarezdiez@wur.nl BackgroundAdvances in sequencing technologies have turned genomics into a data-rich scientific discipline in which the total assembled and subsequently annotated sequence data doubles every 30 months (1). To support the growth in data throughput, automated annotation algorithms have become an indispensable supplement to manual annotation (2, 3) and currently, automatic annotations in the UniProt database outnumber manual annotations 100-fold (4).Functional genome comparison has been used to identify diagnostic markers, to develop effective treatments, and to understand genotype-phenotype associations (5-7). The volume and heterogeneity of genome annotation data has created a unique type of big data challenge, namely how to transform computational predicted annotations into actionable knowledge. Tapping into these available resources is only efficiently done by computational means and requires a consistent interlinking of data so that data becomes Findable, Accessible, Interoperable and Reusable (FAIR) (8).The format for sharing of public genome sequence annotation data has been developed and is maintained by the International Nucleotide Sequence Database Collaboration (INSDC) a long-standing foundational initiative that operates between the DDBJ, EMBL-EBI and NCBI public repositories. However, tradeoffs between simplicity, human readability and representational p...

show abstract

FALDO: A semantic standard for describing the location of nucleotide and protein feature annotation

Abstract: Background Nucleotide and protein sequence feature annotations are essential to understand biology on the

Cited by 9 publications

References 28 publications

HAMAP rules as SPARQL A portable annotation pipeline for genomes and proteomes

HAMAP rules as SPARQL A portable annotation pipeline for genomes and proteomes

BioHackathon series in 2011 and 2012: penetration of ontology and linked data in life science domains

Interoperable genome annotation with GBOL, an extendable infrastructure for functional data mining

Contact Info

Product

Resources

About