DNA methylation is an important epigenetic mark but how its locus-specificity is decided in relation to DNA sequence is not fully understood. Here, we have analyzed 34 diverse wholegenome bisulfite sequencing datasets in human and identified 313 motifs, including 92 and 221 associated with methylation (methylation motifs, MMs) and unmethylation (unmethylation motifs, UMs), respectively. The functionality of these motifs is supported by multiple lines of evidences. First, the methylation levels at the MM and UM motifs are respectively higher and lower than the genomic background. Second, these motifs are enriched at the binding sites of methylation modifying enzymes including DNMT3A and TET1, indicating their possible roles of recruiting these enzymes. Third, these motifs significantly overlap with SNPs associated with gene expression and those with DNA methylation. Fourth, disruption of these motifs by SNPs is associated with significantly altered methylation level of the CpGs in the neighbor regions. Furthermore, these motifs together with somatic SNPs are predictive of cancer subtypes and patient survival. We revealed some of these motifs were also associated with histone modifications, suggesting possible interplay between the two types of epigenetic modifications. We also found some motifs form feed forward loops to contribute to DNA methylation dynamics.
Results
Defining DNA methylation regions and de novo motif discoveryWe aimed to identify DNA motifs associated with DNA methylation and thus started with searching for methylation regions that have the strongest signals. We collected whole genome bisulfite sequencing (WGBS) data of 34 human methylomes generated by the NIH Roadmap Epigenomics Project 15,16 (Figure 1A). We took an approach similar to the Ziller et al. study 17 and defined 1.55 million methylation regions containing 11.5 million CpG sites in the 34 methylomes. Because the methylome data is noisy, we only considered regions containing 2 or more CpGs within 400 bp apart, which covers 29.2% of the human genome.