The first and critical step of accurate heart function analysis (i.e., ventricles volume, Ejection Fraction, and stroke volume measurement) in echocardiography is the detection of end-diastole (ED) and end-systole (ES) frames. Detecting these frames is challenging due to variations in cardiac structure, heart rate associated with clinical conditions, and the low-resolution nature of echo sequences. Several deep learning techniques have recently emerged, primarily combining convolutional neural networks (CNN) and recurrent neural networks (RNN). These models were trained on a range of datasets, including open-access sources like CAMUS, PACS, and EchoNet-Dynamic, as well as private datasets. The largest open-access dataset, EchoNet Dynamic, has poor echo phase detection compared to other datasets. This suggests that the dataset has noise, which is removed in this study using three preprocessing steps: Noise reduction using heart rate formulation, Video frame synchronization, and Non-oscillating Mean absolute frame difference (MAFD). Additionally, this study formulates the echo phase detection problem as frame-level binary classification, distinguishing between diastole and systole phases. The proposed architecture takes the echo sequence as an input and a customized time-distributed CNN extracts spatial features from each frame. These frame features are fed into a bidirectional long short-term memory (BLSTM) network, capturing temporal information, followed by a classification layer. The model is trained on the pre-processed EchoNet-Dynamic dataset. Compared to true labels, an average absolute frame distance (aaFD) of 1.02 and 1.04 frames is achieved for ED and ES frames, respectively. Moreover, it operates with an inference time of less than 65 ms for an input sequence consisting of 32 frames on a graphics processing unit (GPU).