Motivation
Small insertion and deletion (sindel) of human genome has an important implication for human disease. One important mechanism for noncoding sindel to have an impact on human diseases and phenotypes is through the regulation of gene expression. Nevertheless, current sequencing experiments may lack statistical power and resolution to pinpoint the functional sindel due to lower minor allele frequency or small effect size. As an alternative strategy, a supervised machine learning method can identify the otherwise masked functional sindels by predicting their regulatory potential directly. However, computational methods for annotating and predicting the regulatory sindels, especially in the noncoding regions, are underdeveloped.
Results
By leveraging labelled noncoding sindels identified by cis-expression quantitative trait loci (cis-eQTLs) analyses across 44 tissues in GTEx, and a compilation of both generic functional annotations and large-scale epigenomic profiles, we develop TIVAN-indel, which is a supervised computational framework for predicting noncoding regulatory sindels. As a result, we demonstrate that TIVAN-indel achieves the best prediction performance in both with-tissue prediction and cross-tissue prediction. As an independent evaluation, we train TIVAN-indel from the “Whole Blood” tissue in GTEx and test the model using 15 immune cell types from an independent study named DICE. Lastly, we perform an enrichment analysis for both true and predicted sindels in key regulatory regions such as chromatin interactions, open chromatin regions and histone modification sites, and find biologically meaningful enrichment patterns.
Availability
https://github.com/lichen-lab/TIVAN-indel
Supplementary information
Supplementary data are available at Bioinformatics online.