2017
DOI: 10.1093/database/baw163
|View full text |Cite
|
Sign up to set email alerts
|

Duplicates, redundancies and inconsistencies in the primary nucleotide databases: a descriptive study

Abstract: GenBank, the EMBL European Nucleotide Archive and the DNA DataBank of Japan, known collectively as the International Nucleotide Sequence Database Collaboration or INSDC, are the three most significant nucleotide sequence databases. Their records are derived from laboratory work undertaken by different individuals, by different teams, with a range of technologies and assumptions and over a period of decades. As a consequence, they contain a great many duplicates, redundancies and inconsistencies, but neither th… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1

Citation Types

0
19
0

Year Published

2019
2019
2024
2024

Publication Types

Select...
7
2

Relationship

2
7

Authors

Journals

citations
Cited by 45 publications
(25 citation statements)
references
References 64 publications
0
19
0
Order By: Relevance
“…However, such reference databases are often missing for local faunas (Juhel et al, 2020) and their completeness varies for different regions. Thus, the impact of the taxonomic gaps in the reference database on the final taxonomic assignment and on the ecological inferences derived from eDNA data (Chen et al, 2017), and the extent of improvement gained with the development and use of a local database, are relevant aspects needing evaluation in eDNA studies.…”
mentioning
confidence: 99%
“…However, such reference databases are often missing for local faunas (Juhel et al, 2020) and their completeness varies for different regions. Thus, the impact of the taxonomic gaps in the reference database on the final taxonomic assignment and on the ecological inferences derived from eDNA data (Chen et al, 2017), and the extent of improvement gained with the development and use of a local database, are relevant aspects needing evaluation in eDNA studies.…”
mentioning
confidence: 99%
“…This could be a result of malware polymorphism technique which is used by attackers where they made some changes to the application in order to derive different hash in order to evade detection by signature based anti-viruses [59]. These apps eventually could have the same feature set for API calls and therefore need to be removed from dataset to avoid duplication [60]. One of the main contributions of this work is offering a clean ransomware dataset without duplicate apps.…”
Section: Removing Duplicatesmentioning
confidence: 99%
“…A possible solution is to remove duplicate records. However, the notion of duplication is context-dependent; removal of records that might be regarded as duplicates for one task may be harmful to other tasks (Chen et al, 2017b).…”
Section: Introductionmentioning
confidence: 99%
“…Such dramatic increases in numbers of sequence records lead to duplication. Duplicate records can be broadly categorized as of two kinds: entity duplicates, which are records belonging to same entities (Chen et al, 2016b(Chen et al, , 2017bYonchev et al, 2018), such as when the same gene records are submitted to the same database (Chen et al, 2017b); and near-duplicates or redundant records (Suzek et al, 2015;Mirdita et al, 2016;Chen et al, 2018), where records share some specified percentage X% similarity defined by users. For example, the Uniclust protein database defines redundant records at sequence similarity thresholds of 30%, 50%, and 90% for different purposes (Mirdita et al, 2016).…”
mentioning
confidence: 99%