Accurate detection of end-systolic (ES) and end-diastolic (ED) frames in an echocardiographic cine series can be difficult but necessary pre-processing step for the development of automatic systems to measure cardiac parameters. The detection task is challenging due to variations in cardiac anatomy and heart rate often associated with pathological conditions. We formulate this problem as a regression problem and propose several deep learningbased architectures that minimize a novel global extrema structured loss function to localize the ED and ES frames. The proposed architectures integrate convolution neural networks (CNNs)-based image feature extraction model and recurrent neural networks (RNNs) to model temporal dependencies between each frame in a sequence. We explore two CNN architectures: DenseNet and ResNet, and four RNN architectures: long short-term memory, bi-directional LSTM, gated recurrent unit (GRU), and Bi-GRU, and compare the performance of these models. The optimal deep learning model consists of a DenseNet and GRU trained with the proposed loss function. On average, we achieved 0.20 and 1.43 frame mismatch for the ED and ES frames, respectively, which are within reported inter-observer variability for the manual detection of these frames.