Despite the widespread adoption of the ChIP-seq technique, there is still no consensus on quality assessment procedures. Quantitative metrics previously proposed in literature are not always effective in discriminating the success or failure of an experiment, thus hampering objectivity and reproducibility of quality control. Here we introduce ChIC, a new framework for ChIP-seq data quality assessment that overcomes the limitations of previous solutions.ChIC is the first method for ChIP-seq quality control directly considering the enrichment profile shape, thus achieving good performances on ChIP targets yielding sharp and broad peaks alike. We integrate a comprehensive set of quality control metrics into one single score reliably summarizing the sample quality. The ChIC score is based on a machine learning classifier trained on a compendium with thousands of ChIP-seq profiles, which can also be used as a reference for easier evaluation of new datasets. ChIC is implemented as a user-friendly R/Bioconductor package.
RESULTS
Standard QC-metrics are biased by the shape of ChIP-seq enrichment profileWe analysed a large set of ChIP-seq experiments with paired input control, for a total of 3936 samples from large public databases (Table 1), including the ENCODE project [23] and Roadmap Epigenomics Consortium (Roadmap) [24], to build a reference compendium of QC-metrics, including previously proposed and novel metrics (Fig. 1a).Previously proposed QC quantitative metrics for ChIP-seq do not explicitly take into account the shape of the enrichment profile, yet this is affecting the QC score values. For example, some of the QC-metrics proposed by the ENCODE consortium are more effective on ChIP targets yielding narrow peak profiles, such as TFs, because they are designed for ChIP-seq enrichment profiles with sequencing reads localized in small regions. This is the case for all the metrics based on strand-shift analyses [3,4,7,10,11]. Strand-shift analyses, such as the "cross-correlation" analysis (Supplementary Figure S1a; methods), aim to detect the clustering of reads in a ChIP-seq sample without relying on peak calling. The strand-shift profile is calculated using the position of reads on positive vs negative strand and shifting them towards each other. At each progressive shift the correlation [3,4] between the position of reads on the two strands is calculated. In other variants of this analysis, other measures like the Jaccard Index [7] or the Hamming distance [15] have been proposed instead to quantify the clustering of reads on positive and negative strands. Then, multiple QC-metrics for the strength of the ChIP enrichment can be derived from the resulting crosscorrelation profile, for example the normalized or relative strand coefficient (NSC and RSC, respectively) or the quality control tag (QC tag) [3,4,11]. We refer to this set of metrics, along with others described by Landt et al. [3], as ENCODE Metrics (EM) (see methods).Notably, the relative strand coefficient (RSC) is proportional to the level of clusterin...