Current Hi-C analysis approaches are unable to account for reads that align to multiple 11 locations, and hence underestimate biological signal from repetitive regions of genomes. We 12 developed mHi-C, a multi-read mapping strategy to probabilistically allocate Hi-C multi-reads. mHi-C 13 exhibited superior performance over utilizing only uni-reads and heuristic approaches aimed at 14 rescuing multi-reads on benchmarks. Specifically, mHi-C increased the sequencing depth by an 15 average of 20% leading to higher reproducibility of contact matrices and larger number of 16 significant interactions across biological replicates. The impact of the multi-reads on the 17 identification of novel significant interactions is influenced marginally by relative contribution of 18 multi-reads to the sequencing depth compared to uni-reads, cis-to-trans ratio of contacts, and the 19 broad data quality as reflected by the proportion of mappable reads of datasets. Computational 20 experiments highlighted that in Hi-C studies with short read lengths, mHi-C rescued multi-reads 21 can emulate the effect of longer reads. mHi-C also revealed biologically supported bona fide 22 promoter-enhancer interactions and topologically associating domains involving repetitive genomic 23 regions, thereby unlocking a previously masked portion of the genome for conformation capture 24 studies. 25 26 106 genomic origin. Contacts captured by Hi-C assay can arise as random contacts of nearby genomic 107 positions or true biological interactions. mHi-C generative model acknowledges this feature by 108 utilizing data-driven priors, ( , ) for bin pairs and , as a function of contact distance between the 109 two bins. mHi-C updates these prior probabilities for each candidate bin pair that a multi-read can 110 be allocated to by leveraging local contact counts. As a result, for each multi-read , it estimates 111 posterior probabilities of genomic origin variable . Specifically, ( ,( , ) = 1 | , ) denotes the 112 posterior probability, i.e., allocation probability, that the two read ends of multi-read originate 113 from bin pairs and . These posterior probabilities, which can also be viewed as fractional contacts 114 of multi-read , are then utilized to assign each multi-read to most likely genomic origin. Our results 115 in this paper only utilized reads with allocation probability greater than 0.5. This ensured the output 116 of mHi-C to be compatible with the standard input of the downstream normalization and statistical 117 significance estimation methods (Imakaev et al., 2012; Knight and Ruiz, 2013; Ay et al., 2014a). 118
of 30Probabilistic assignment of multi-reads leads to more complete contact matrices 119 and improves reproducibility across replicates 120 Before quantifying mHi-C model performance, we first provide direct visual comparison of the 121 contact matrices between Uni-setting and Uni&Multi-setting using raw contact counts and nor-122 malized contact counts. We utilize Knight-Ruiz Matrix Balancing normalization (Knight and Ru...