Perm-seq: Mapping Protein-DNA Interactions in Segmental Duplication and Highly Repetitive Regions of Genomes with Prior-Enhanced Read Mapping

Zeng, Xin; Li, Bo; Welch, Rene; Rojo, Constanza; Zheng, Yi; Dewey, Colin N.; Keleş, Sündüz

doi:10.1371/journal.pcbi.1004491

Cited by 13 publications

(15 citation statements)

References 40 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In addition to discovery of novel TADs (Figure 5—figure supplement 3) by filling in the gaps in the contact matrix and boosting the domain signals, mHi-C also refines TAD boundaries (Figure 5—figure supplements 4 and 5), and eliminates potential false positive TADs that are split by the contact depleted gaps in Uni-setting (Figure 5—figure supplements 6–8). The novel, adjusted, and eliminated TADs are largely supported by CTCF signal identified using both uni- and multi-reads ChIP-seq datasets (Zeng et al, 2015) as well as convergent CTCF motifs (Figure 5—figure supplement 2D), providing support for mHi-C driven modifications to these TADs and revealing a slightly lower false discovery rate for mHi-C compared to Uni-setting (Figure 5C, Figure 5—figure supplement 2E, and Figure 5—figure supplement 9).…”

Section: Resultsmentioning

confidence: 86%

“…Such reads from repetitive regions can be aligned to multiple positions (Figure 1A) and are referred to as multi-mapping reads or multi-reads for short. The critical drawbacks of discarding multi-reads have been recognized in other classes of genomic studies such as transcriptome sequencing (RNA-seq) (Li and Dewey, 2011), chromatin immunoprecipitation followed by high throughput sequencing (ChIP-seq) (Chung et al, 2011; Zeng et al, 2015), as well as genome-wide mapping of protein-RNA binding sites (CLIP-seq or RIP-seq) (Zhang and Xing, 2017). More recently, (Sun et al, 2018) and (Cournac et al, 2016) argued for a fundamental role of repeat elements in the 3D folding of genomes, highlighting the role of higher order chromatin architecture in repeat expansion disorders.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Generative modeling of multi-mapping reads with mHi-C advances analysis of Hi-C studies

Zheng

Keleş

2019

eLife

Self Cite

View full text Add to dashboard Cite

Current Hi-C analysis approaches are unable to account for reads that align to multiple locations, and hence underestimate biological signal from repetitive regions of genomes. We developed and validated mHi-C, a multi-read mapping strategy to probabilistically allocate Hi-C multi-reads. mHi-C exhibited superior performance over utilizing only uni-reads and heuristic approaches aimed at rescuing multi-reads on benchmarks. Specifically, mHi-C increased the sequencing depth by an average of 20% resulting in higher reproducibility of contact matrices and detected interactions across biological replicates. The impact of the multi-reads on the detection of significant interactions is influenced marginally by the relative contribution of multi-reads to the sequencing depth compared to uni-reads, cis-to-trans ratio of contacts, and the broad data quality as reflected by the proportion of mappable reads of datasets. Computational experiments highlighted that in Hi-C studies with short read lengths, mHi-C rescued multi-reads can emulate the effect of longer reads. mHi-C also revealed biologically supported bona fide promoter-enhancer interactions and topologically associating domains involving repetitive genomic regions, thereby unlocking a previously masked portion of the genome for conformation capture studies.

show abstract

Section: Resultsmentioning

confidence: 86%

Section: Introductionmentioning

confidence: 99%

Generative modeling of multi-mapping reads with mHi-C advances analysis of Hi-C studies

Zheng

Keleş

2019

eLife

Self Cite

View full text Add to dashboard Cite

show abstract

“…29-31). The novel, adjusted, and eliminated TADs are largely supported by CTCF ChIP-seq signal 21 as well as convergent CTCF motifs ( Supplementary Fig. 25d), providing evidence for mHi-C driven modifications to these TADs and revealing a lower false discovery rate for mHi-C compared to Uni-setting ( Fig.…”

mentioning

confidence: 89%

“…1a -left) and are referred to as multi-mapping reads or multi-reads for short. The critical drawbacks of discarding multi-reads have been recognized in other classes of genomic studies such as transcriptome sequencing (RNA-seq) 19 , chromatin immunoprecipitation followed by high throughput sequencing (ChIP-seq) 20,21 , as well as genome-wide mapping of protein-RNA binding sites (CLIP-seq or RIP-seq) 22 . In this work, we developed mHi-C, a hierarchical model that probabilistically allocates Hi-C multi-reads to their most likely genomic origins by utilizing specific characteristics of the paired-end reads of the Hi-C assay.…”

mentioning

confidence: 99%

Generative Modeling of Multi-mapping Reads with mHi-C Advances Analysis of High Throughput Genome-wide Conformation Capture Studies

Zheng

Keleş

2018

Preprint

Self Cite

View full text Add to dashboard Cite

Current Hi-C analysis approaches are unable to account for reads that align to multiple 11 locations, and hence underestimate biological signal from repetitive regions of genomes. We 12 developed mHi-C, a multi-read mapping strategy to probabilistically allocate Hi-C multi-reads. mHi-C 13 exhibited superior performance over utilizing only uni-reads and heuristic approaches aimed at 14 rescuing multi-reads on benchmarks. Specifically, mHi-C increased the sequencing depth by an 15 average of 20% leading to higher reproducibility of contact matrices and larger number of 16 significant interactions across biological replicates. The impact of the multi-reads on the 17 identification of novel significant interactions is influenced marginally by relative contribution of 18 multi-reads to the sequencing depth compared to uni-reads, cis-to-trans ratio of contacts, and the 19 broad data quality as reflected by the proportion of mappable reads of datasets. Computational 20 experiments highlighted that in Hi-C studies with short read lengths, mHi-C rescued multi-reads 21 can emulate the effect of longer reads. mHi-C also revealed biologically supported bona fide 22 promoter-enhancer interactions and topologically associating domains involving repetitive genomic 23 regions, thereby unlocking a previously masked portion of the genome for conformation capture 24 studies. 25 26 106 genomic origin. Contacts captured by Hi-C assay can arise as random contacts of nearby genomic 107 positions or true biological interactions. mHi-C generative model acknowledges this feature by 108 utilizing data-driven priors, ( , ) for bin pairs and , as a function of contact distance between the 109 two bins. mHi-C updates these prior probabilities for each candidate bin pair that a multi-read can 110 be allocated to by leveraging local contact counts. As a result, for each multi-read , it estimates 111 posterior probabilities of genomic origin variable . Specifically, ( ,( , ) = 1 | , ) denotes the 112 posterior probability, i.e., allocation probability, that the two read ends of multi-read originate 113 from bin pairs and . These posterior probabilities, which can also be viewed as fractional contacts 114 of multi-read , are then utilized to assign each multi-read to most likely genomic origin. Our results 115 in this paper only utilized reads with allocation probability greater than 0.5. This ensured the output 116 of mHi-C to be compatible with the standard input of the downstream normalization and statistical 117 significance estimation methods (Imakaev et al., 2012; Knight and Ruiz, 2013; Ay et al., 2014a). 118 of 30Probabilistic assignment of multi-reads leads to more complete contact matrices 119 and improves reproducibility across replicates 120 Before quantifying mHi-C model performance, we first provide direct visual comparison of the 121 contact matrices between Uni-setting and Uni&Multi-setting using raw contact counts and nor-122 malized contact counts. We utilize Knight-Ruiz Matrix Balancing normalization (Knight and Ru...

show abstract

“…One method focused on detecting structural variation using discordant read pairs includes VariationHunter, which constructs consistent clusters of reads, including probabilistic assignment of reads that have multiple mappings [24]. Similarly, probabilistic approaches have been used to call ChIP-seq peaks from multiply-mapped reads [25,26]. These approaches are geared toward discovering genetic variation or functional genomics signals in repetitive sequences, and generally work by modeling the distribution of signals among multiple read placements.…”

Section: Introductionmentioning

confidence: 99%

Rapid, Paralog-Sensitive CNV Analysis of 2457 Human Genomes Using QuicK-mer2

Shen

Kidd

2020

Genes

View full text Add to dashboard Cite

Gene duplication is a major mechanism for the evolution of gene novelty, and copy-number variation makes a major contribution to inter-individual genetic diversity. However, most approaches for studying copy-number variation rely upon uniquely mapping reads to a genome reference and are unable to distinguish among duplicated sequences. Specialized approaches to interrogate specific paralogs are comparatively slow and have a high degree of computational complexity, limiting their effective application to emerging population-scale data sets. We present QuicK-mer2, a self-contained, mapping-free approach that enables the rapid construction of paralog-specific copy-number maps from short-read sequence data. This approach is based on the tabulation of unique k-mer sequences from short-read data sets, and is able to analyze a 20X coverage human genome in approximately 20 min. We applied our approach to newly released sequence data from the 1000 Genomes Project, constructed paralog-specific copy-number maps from 2457 unrelated individuals, and uncovered copy-number variation of paralogous genes. We identify nine genes where none of the analyzed samples have a copy number of two, 92 genes where the majority of samples have a copy number other than two, and describe rare copy number variation effecting multiple genes at the APOBEC3 locus.

show abstract

Perm-seq: Mapping Protein-DNA Interactions in Segmental Duplication and Highly Repetitive Regions of Genomes with Prior-Enhanced Read Mapping

Cited by 13 publications

References 40 publications

Generative modeling of multi-mapping reads with mHi-C advances analysis of Hi-C studies

Generative modeling of multi-mapping reads with mHi-C advances analysis of Hi-C studies

Generative Modeling of Multi-mapping Reads with mHi-C Advances Analysis of High Throughput Genome-wide Conformation Capture Studies

Rapid, Paralog-Sensitive CNV Analysis of 2457 Human Genomes Using QuicK-mer2

Contact Info

Product

Resources

About