.Due to the increase of crime and terror, security concerns are rising rapidly every day. The use of surveillance cameras for abnormal behavior detection has become an indispensable part of human beings. But the performance of most of the developed systems is not up to the mark because of the low performance and accuracy in detecting the abnormality in the videos due to mainly the presence of noise. The videos captured by the surveillance camera are generally born with no or more noise due to various reasons. To resolve such issues, we provide a snapshot regarding different categories of noise and handcraft techniques to resolve them. Non-local means, block matching, and 3D filtering filters perform astonishingly well while denoising the images. We also present a robust unsupervised deep learning model called deep stacked denoising autoencoder (DSDAE) for denoising the images and further use it for abnormal activity detection and localization in the videos. Our approach has achieved a noteworthy result in image denoising compared to other handcraft-based techniques. DSDAE uses a separate encoder for the extraction of appearance features using clean and noisy images and motion features through the optical flow images. Early fusion is done in the extracted features and passed to the decoder. Only those pixels whose reconstruction error is greater than the threshold will be considered abnormal pixels. Experiment results are compared quantitatively/qualitatively with the recent competitive state-of-the-art methods in the publicly available benchmark datasets Ped1, Ped2, CUHK Avenue, and ShanghaiTech that demonstrate the superior accuracy and performance of our DSDAE. The obtained area under the curve of DSDAE in Ped1, Ped2, CUHK Avenue, and ShanghaiTech is 98.14%, 97.92%, 95.89%, and 96.7%, respectively, whereas equal error rate for the same datasets is 5.4%, 4.5%, 12.03%, and 7.8%, respectively.