2019
DOI: 10.1093/nar/gkz841
|View full text |Cite
|
Sign up to set email alerts
|

Tandem repeats lead to sequence assembly errors and impose multi-level challenges for genome and protein databases

Abstract: The widespread occurrence of repetitive stretches of DNA in genomes of organisms across the tree of life imposes fundamental challenges for sequencing, genome assembly, and automated annotation of genes and proteins. This multi-level problem can lead to errors in genome and protein databases that are often not recognized or acknowledged. As a consequence, end users working with sequences with repetitive regions are faced with ‘ready-to-use’ deposited data whose trustworthiness is difficult to determine, let al… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

2
224
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
4
3
2

Relationship

1
8

Authors

Journals

citations
Cited by 262 publications
(226 citation statements)
references
References 138 publications
(152 reference statements)
2
224
0
Order By: Relevance
“…In brief, high quality Illumina reads were prepared using TrimGalore (https://github.com/FelixKrueger/TrimGalore) based on the following criteria: (i) no “N” base, (ii) trimming of adaptor sequences and low quality bases (Q<20), (iii) no trimmed reads < 100 bp. To avoid mis-assembly due to repetitive sequences (Tørresen et al, 2019), PacBio SEQUEL subreads with repetitive sequences comprised over 85% of total sequences were filtered out. The GC content criteria (<25% and >85%) was applied for filtering low complexity DNA sequences before assembly.…”
Section: Methodsmentioning
confidence: 99%
“…In brief, high quality Illumina reads were prepared using TrimGalore (https://github.com/FelixKrueger/TrimGalore) based on the following criteria: (i) no “N” base, (ii) trimming of adaptor sequences and low quality bases (Q<20), (iii) no trimmed reads < 100 bp. To avoid mis-assembly due to repetitive sequences (Tørresen et al, 2019), PacBio SEQUEL subreads with repetitive sequences comprised over 85% of total sequences were filtered out. The GC content criteria (<25% and >85%) was applied for filtering low complexity DNA sequences before assembly.…”
Section: Methodsmentioning
confidence: 99%
“…Despite much interest [8,[14][15][16], the most recent and commonly cited census of protein TRs summarizing repeats in the curated protein knowledge base UniProtKB/Swiss-Prot [17] dates back two decades [18]. Since then this popular data bank has grown more than seven-fold ( Figure S1).…”
Section: Comprehensive Annotation Of Proteomic Tandem Repeatsmentioning
confidence: 99%
“…This allows our study to provide an unprecedented detail of the universe of protein TRs. We respond to the call [14] and apply the state-of-the-art method for TR detection followed by filtering through a sound statistical framework.…”
Section: Comprehensive Annotation Of Proteomic Tandem Repeatsmentioning
confidence: 99%
“…Widely used sequencing technologies, such as Sanger, 454 and Illumina, have played a pivotal part in these advancements. However, the limitations of these technologies, namely their trouble reading through repetitive regions and their short read outputs, have led to assembly artifacts that are currently widely distributed in genome and proteome databases 43 . A number of protozoan parasite genomes have been recently revisited using third generation sequencing technologies.…”
Section: Discussionmentioning
confidence: 99%