Polygraphs are used in criminal interrogations to detect deception. However, polygraphs can be difficult to administer under circumstances that prevent the use of biosensors. To address the shortcomings of the biosensors, deception-detection technology without biosensors is needed. We propose a deception-detection method, FacialCueNet, which is a multi-modal network that utilizes both facial images and facial cues based on deep-learning technology. FacialCueNet incorporates facial cues that indicate deception, such as action-unit frequency, symmetry, gaze pattern, and micro-expressions extracted from videos. Additionally, the spatial-temporal attention module, based on convolutional neural network and convolutional long short-term memory, is applied to FacialCueNet to provide interpretable information from interrogations. Because our goal was developing an algorithm applicable to criminal interrogations, we trained and evaluated FacialCueNet using the DDCIT dataset, which was collected using a data acquisition protocol similar to those used in actual investigations. To compare deception-detection performance with state-of-the-art works, a public dataset was also used. As a result, the mean deception-detection F1 score using the DDCIT dataset was 81.22%, with an accuracy of 70.79%, recall of 0.9476, and precision of 0.7107. When evaluating against the public database, our method demonstrated an evaluation accuracy of 88.45% and achieved an AUC of 0.9541, indicating a improvement of 1.25% compared to the previous results. We also present interpretive results of deception detection by analyzing the influence of spatial and temporal factors. These results show that FacialCueNet has the potential to detect deception using only facial videos. By providing interpretation of predictions, our system could be useful tool for criminal interrogation.