Low-dose computed tomography (LDCT) is crucial due to the risk of radiation exposure to patients. However, the high noise level in LDCT images may reduce the image quality, leading to a less accurate diagnosis. Deep learning technology, especially supervised methods, has recently been widely accepted as a powerful tool for LDCT image denoising tasks. However, supervised methods require numerous paired datasets of LDCT and high-quality pristine CT images, which are rarely available in real-world clinical scenarios. This study presents an unsupervised learning-based framework called MM-Net, consisting of two training steps for a volumetric LDCT denoising task. In the two-step training approach, we first train the initial denoising network multi-scale attention U-Net (MSAU-Net) in a self-supervised manner to predict the noise-suppressed center slice with a neighboring multi-slice input. The second training step aims to train the U-Net-based final denoiser based on the pre-trained MSAU-Net to improve the image quality by introducing new multi-patch and multi-mask matching loss. Qualitative visual inspection and quantitative measures across texturally different domains of clinical and animal data reveal that the proposed MM-Net outperformed all competing state-ofthe-art unsupervised algorithms. The unsupervised method also achieved denoising performance comparable to the representative supervised methods trained with ground truth images.