Rapid advancements in deep learning-based technologies have developed several synthetic video and audio generation methods producing incredibly hyper-realistic deepfakes. These deepfakes can be employed to impersonate the identity of a source person in videos by swapping the source's face with the target one. Deepfakes can also be used to clone the voice of a target person utilizing audio samples. Such deepfakes may pose a threat to societies if they are utilized maliciously. Consequently, distinguishing either one or both deepfake visual video frames and cloned voices from genuine ones has become an urgent issue. This work presents a novel smart deepfake video detection system. The video frames and audio are extracted from given videos. Two feature extraction methods are proposed, one for each modality; visual video frames, and audio. The first method is an upgraded XceptionNet model, which is utilized for extracting spatial features from video frames. It produces feature representation for visual video frames. The second one is a modified InceptionResNetV2 model based on the Constant-Q Transform (CQT) method. It is employed to extract deep timefrequency features from the audio modality. It produces feature representation for the audio. The corresponding extracted features of both modalities are fused at a mid-layer to produce a bimodal information-based feature representation for the whole video. These three representation levels are independently fed into the Gated Recurrent Unit (GRU) based attention mechanism helping to learn and extract deep and important temporal information per level. Then, the system checks whether the forgery is only applied to video frames, audio, or both, and produces the final decision about video authenticity. The newly suggested method has been evaluated on the FakeAVCeleb multimodal videos dataset. The experimental results analysis assures the superiority of the new method over the current-stateof-the-art methods.