Transcription factors (TFs) play an important role in
gene expression
and regulation of 3D genome conformation. TFs have ability to bind
to specific DNA fragments called enhancers and promoters. Some TFs
bind to promoter DNA fragments which are near the transcription initiation
site and form complexes that allow polymerase enzymes to bind to initiate
transcription. Previous studies showed that methylated DNAs had ability
to inhibit and prevent TFs from binding to DNA fragments. However,
recent studies have found that there were TFs that could bind to methylated
DNA fragments. The identification of these TFs is an important steppingstone
to a better understanding of cellular gene expression mechanisms.
However, as experimental methods are often time-consuming and labor-intensive,
developing computational methods is essential. In this study, we propose
two machine learning methods for two problems: (1) identifying TFs
and (2) identifying TFs that prefer binding to methylated DNA targets
(TFPMs). For the TF identification problem, the proposed method uses
the position-specific scoring matrix for data representation and a
deep convolutional neural network for modeling. This method achieved
90.56% sensitivity, 83.96% specificity, and an area under the receiver
operating characteristic curve (AUC) of 0.9596 on an independent test
set. For the TFPM identification problem, we propose to use the reduced
g
-gap dipeptide composition for data representation and
the support vector machine algorithm for modeling. This method achieved
82.61% sensitivity, 64.86% specificity, and an AUC of 0.8486 on another
independent test set. These results are higher than those of other
studies on the same problems.