Identifying functional enhancers elements in metazoan systems is a major challenge.For example, large-scale validation of enhancers predicted by ENCODE reveal false positive rates of at least 70%. Here we use the pregrastrula patterning network of Drosophila melanogaster to demonstrate that loss in accuracy in held out data results from heterogeneity of functional signatures in enhancer elements. We show that two classes of enhancer are active during early Drosophila embryogenesis and that by focusing on a single, relatively homogeneous class of elements, over 98% prediction accuracy can be achieved in a balanced, completely held-out test set. The class of well predicted elements is composed predominantly of enhancers driving multi-stage, segmentation patterns, which we designate segmentation driving enhancers (SDE). Prediction is driven by the DNA occupancy of early developmental transcription factors, with almost no additional power derived from histone modifications. We further show that improved accuracy is not a property of a particular prediction method: after conditioning on the SDE set, naïve Bayes and logistic regression perform as well as more sophisticated tools. Applying this method to a genome-wide scan, we predict 1,640 SDEs that cover 1.6% of the genome, 916 of which are novel. An analysis of 32 novel SDEs using wholemount embryonic imaging of stably integrated reporter constructs chosen throughout our prediction rank-list showed >90% drove expression patterns. We achieved 86.7% precision on a genome-wide scan, with an estimated recall of at least 98%, indicating high accuracy and completeness in annotating this class of functional elements.. CC-BY-ND 4.0 International license It is made available under a (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.The copyright holder for this preprint . http://dx.doi.org/10.1101/250241 doi: bioRxiv preprint first posted online Jan. 18, 2018; 3
Significance StatementWe demonstrate a high accuracy method for predicting enhancers genome wide with > 85% precision as validated by transgenic reporter assays in Drosophila embryos. This is the first time such accuracy has been achieved in a metazoan system, allowing us to predict with highconfidence 1640 enhancers, 916 of which are novel. The predicted enhancers are demarcated by heterogeneous collections of epigenetic marks; many strong enhancers are free from classical indicators of activity, including H3K27ac, but are bound by key transcription factors.H3K27ac, often used as a one-dimensional predictor of enhancer activity, is an uninformative parameter in our data.