2014
DOI: 10.3389/fgene.2014.00381
|View full text |Cite
|
Sign up to set email alerts
|

Mappability and read length

Abstract: Power-law distributions are the main functional form for the distribution of repeat size and repeat copy number in the human genome. When the genome is broken into fragments for sequencing, the limited size of fragments and reads may prevent an unique alignment of repeat sequences to the reference sequence. Repeats in the human genome can be as long as 104 bases, or 105 − 106 bases when allowing for mismatches between repeat units. Sequence reads from these regions are therefore unmappable when the read length… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

1
39
0

Year Published

2015
2015
2024
2024

Publication Types

Select...
5
3
2

Relationship

0
10

Authors

Journals

citations
Cited by 47 publications
(40 citation statements)
references
References 66 publications
1
39
0
Order By: Relevance
“…We first filtered our variant call set for rare heterozygous coding variants (MAF<=1x10 -4 across all populations represented in gnomAD and ExAC databases). To account for regions in the reference genome that are more challenging to resolve, we removed variant sites found in regions of non-unique mappability (score<1; 300bp), likely segmental duplication (score>0.95), and known low-complexity 82 . We then excluded sites located in MUC and HLA genes and imposed a maximum variant read depth threshold of 500.…”
Section: Pre-processing and Qcmentioning
confidence: 99%
“…We first filtered our variant call set for rare heterozygous coding variants (MAF<=1x10 -4 across all populations represented in gnomAD and ExAC databases). To account for regions in the reference genome that are more challenging to resolve, we removed variant sites found in regions of non-unique mappability (score<1; 300bp), likely segmental duplication (score>0.95), and known low-complexity 82 . We then excluded sites located in MUC and HLA genes and imposed a maximum variant read depth threshold of 500.…”
Section: Pre-processing and Qcmentioning
confidence: 99%
“…This can be due to a variety of reasons, including ''pseudogenes'' (areas of the DNA that look like a particular gene, but are slightly different), repetitive sequences, and regions of duplications or deletions, all of which naturally occur in all individuals. At read lengths typically employed in NGS, 5-10% of the genome cannot be confidently mapped due to these factors [reviewed in Li and Freudenberg (2014)]. …”
Section: Sequencing Technologiesmentioning
confidence: 99%
“…Repetitive elements in the genome, because of the presence of similar or identical sequences, sometimes cause mapping error of the short reads [6][7][8]. The chloroplast genome of higher plants contains the inverted repeat region (IR) [9].…”
Section: Introductionmentioning
confidence: 99%