2016
DOI: 10.1101/085324
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Benchmarks for Measurement of Duplicate Detection Methods in Nucleotide Databases

Abstract: Duplication of information in databases is a major data quality challenge. The presence of duplicates, implying either redundancy or inconsistency, can have a range of impacts on the quality of analyses that use the data. To provide a sound basis for research on this issue in databases of nucleotide sequences, we have developed new, large-scale validated collections of duplicates, which can be used to test the effectiveness of duplicate detection methods. Previous collections were either designed primarily to … Show more

Help me understand this report
View published versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

0
2
0

Year Published

2016
2016
2018
2018

Publication Types

Select...
2
1

Relationship

2
1

Authors

Journals

citations
Cited by 3 publications
(2 citation statements)
references
References 40 publications
0
2
0
Order By: Relevance
“…We assumed that we were provided with an appropriate dictionary for biomedical embeddings. We evaluated our model on a large benchmark dataset [10] consisting of 21 most heavily studied organisms in molecular biology. Our method was able to beat ML models with hand-crafted approach in 11 of these cases while it was within a F1-score of ± 5 for the remaining.…”
Section: Comparison With Existing Methodsmentioning
confidence: 99%
“…We assumed that we were provided with an appropriate dictionary for biomedical embeddings. We evaluated our model on a large benchmark dataset [10] consisting of 21 most heavily studied organisms in molecular biology. Our method was able to beat ML models with hand-crafted approach in 11 of these cases while it was within a F1-score of ± 5 for the remaining.…”
Section: Comparison With Existing Methodsmentioning
confidence: 99%
“…To our knowledge, our collection is the largest set of duplicate records merged in INSDC considered to date. Note that we have collected even larger datasets based on other strategies, including expert and automatic curation (51). We focus on this collection here, to analyse how submitters understand duplicates as one perspective.…”
Section: Characteristics Of the Duplicate Collectionmentioning
confidence: 99%