2013
DOI: 10.1371/journal.pone.0075541
|View full text |Cite
|
Sign up to set email alerts
|

Can Inferred Provenance and Its Visualisation Be Used to Detect Erroneous Annotation? A Case Study Using UniProtKB

Abstract: A constant influx of new data poses a challenge in keeping the annotation in biological databases current. Most biological databases contain significant quantities of textual annotation, which often contains the richest source of knowledge. Many databases reuse existing knowledge; during the curation process annotations are often propagated between entries. However, this is often not made explicit. Therefore, it can be hard, potentially impossible, for a reader to identify where an annotation originated from. … Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
12
0

Year Published

2016
2016
2022
2022

Publication Types

Select...
6
2

Relationship

1
7

Authors

Journals

citations
Cited by 11 publications
(12 citation statements)
references
References 32 publications
0
12
0
Order By: Relevance
“…While this form of reuse is part of the folk-history of bioinformatics ( ), and is apparent from even a short perusal of a few bioinformatics databases, it has rarely been explicitly studied. In a previous study ( Bell et al , 2013 ), we have shown that the level of reuse in UniProtKB is extremely high—the most reused sentence in TrEMBL occurs more than seven million times, while the most common sentence in Swiss-Prot occurs more than 91 000 times. Moreover, we have shown that this reuse operates as an informal indicator of provenance; two identical sentences are likely to share a common history.…”
Section: Introductionmentioning
confidence: 86%
“…While this form of reuse is part of the folk-history of bioinformatics ( ), and is apparent from even a short perusal of a few bioinformatics databases, it has rarely been explicitly studied. In a previous study ( Bell et al , 2013 ), we have shown that the level of reuse in UniProtKB is extremely high—the most reused sentence in TrEMBL occurs more than seven million times, while the most common sentence in Swiss-Prot occurs more than 91 000 times. Moreover, we have shown that this reuse operates as an informal indicator of provenance; two identical sentences are likely to share a common history.…”
Section: Introductionmentioning
confidence: 86%
“…sequencing and annotation errors propagated by reuse and not eliminated by additional published sequences that would show it to be statistically insignificant). For annotations, it has been shown that it is possible to detect low-quality entries, resulting from this denormalisation, by looking for specific patterns of provenance in the database (35). With respect to gene models, this problem could be addressed in the future through the integration of RNA-Seq datasets in the annotation of new genome sequences.…”
Section: Denormalisationmentioning
confidence: 99%
“…It may be comforting to assume that the converse relationship may hold more universally: perhaps paralogous genes in the same species always have different functions. However, duplicate genes with equivalent functions can be retained to supply specific gene or protein dosage, providing a scenario in which paralogous genes have equivalent functions [ 27 , 83 , 84 , 85 , 86 ]. In multicellular eukaryotes, this can occur when gene regulatory functions are partitioned between daughter genes (i.e., the same protein is expressed by different genes in different tissues, cell types, or different periods of time/development).…”
Section: Protein Function and Evolutionmentioning
confidence: 99%
“…Early pioneers in function annotation were quick to identify potential problems with large-scale annotation efforts [ 26 , 27 , 28 ], and misannotation is a growing concern among the general research community, as misannotated genes can have a “ripple effect” impacting diverse areas of biological inquiry [ 29 , 30 , 31 ]. By different measures, 10%–25% of functional calls are wrong, even in very small bacterial genomes [ 32 ].…”
Section: Introductionmentioning
confidence: 99%