Numerous models for sleep stage scoring utilizing single-channel raw EEG signal have typically employed CNN and BiLSTM architectures. While these models, incorporating temporal information for sequence classification, demonstrate superior overall performance, they often exhibit low per-class performance for N1-stage, necessitating an adjustment of loss function. However, the efficacy of such adjustment is constrained by the training process. In this study, a pioneering training approach called separating training is introduced, alongside a novel model, to enhance performance. The developed model comprises 15 CNN models with varying loss function weights for feature extraction and 1 BiLSTM for sequence classification. Due to its architecture, this model cannot be trained using an end-to-end approach, necessitating separate training for each component using the Sleep-EDF dataset. Achieving an overall accuracy of 87.02%, MF1 of 82.09%, Kappa of 0.8221, and per-class F1-socres (W 90.34%, N1 54.23%, N2 89.53%, N3 88.96%, and REM 87.40%), our model demonstrates promising performance. Comparison with sleep technicians reveals a Kappa of 0.7015, indicating alignment with reference sleep stags. Additionally, cross-dataset validation and adaptation through training with the SHHS dataset yield an overall accuracy of 84.40%, MF1 of 74.96% and Kappa of 0.7785 when tested with the Sleep-EDF-13 dataset. These findings underscore the generalization potential in model architecture design facilitated by our novel training approach.