When there is a speech stream with numerous speakers and background noise, the human auditory system can focus on a single speaker of interest and disregard the others. Studies on this scenario have demonstrated that cortical activity follows the speech envelope where the attended speech envelope was stronger than the unattended speech. By connecting speech signals and neural activity, it has been shown that electroencephalography (EEG) signals can be used to determine which speaker a listener is listening to. The human brain is inherently non-linear where deep learning techniques may use to analyze neural signals (EEG data) to decode the changing state of the brain. The non-linear methods proposed in the recent studies have not used any speech streams or have used only envelope of the speech stream as input features. The neural networks performance can be increased by providing information about the speech stream. To extract auditory attention, we introduce a hybrid convolutional neural network and a long short-term memory model (CNN-LSTM). The CNN-LSTM model is employed as an input to decipher attention in a two-speaker situation using cortical recordings and the spectrogram of multiple speech streams. The decoding accuracy of the model in short trial durations are measured. The model CNN-LSTM proposed for audio source localization can be used for developing neuro-steered hearing devices and brain computer interface (BCI) applications.