Various algorithms exist for the audio deep fake synthesis, such as deep voice, tacotron, fastspeech, and imitation techniques. Despite the existence of various spoofing speech detectors, they are not ready to distinguish unseen audio samples with high precision. In this study, we suggest a robust model, namely Ensemble Deep Learning based Detector (EDL-Det) to detect text-to-speech (TTS) and categorize it into spoofed and bonafide classes. Our proposed model is an improved method based on YAMNet employing VGG19 as a base network instead of MobileNet combined with two other deep learning(DL) methods. Our proposed system effectively analyzes the mel-spectrograms generated from input audio to extract the better artifacts underlying the audio signals. We have added an ensemble learning block that consists of ResNet50, and InceptionNetv2. First, we convert speech into mel-spectrograms that consist of time-frequency representations. Second, we train our model using the ASVspoof-2019 dataset. In the end, we classified the audios converting them into mel-spectrograms using our trained binary classifier along with a majority voting scheme by three networks. Due to deep convolutional network architecture, our proposed model effectively extracts the most representative features from the mel-spectrograms. Furthermore, we have performed extensive experiments to assess the performance of the suggested model using the ASVspoof 2019 corpus. Additionally, our proposed model is robust enough to identify the unseen spoofed audios and accurately classify the attacks based on cloning algorithms.