Over the past decade, the significant growth of the convolutional neural network (CNN) based on deep learning (DL) approaches has greatly improved the machine learning (ML) algorithm's performance on the semantic scene classification (SSC) of remote sensing images (RSI). However, the unbalanced attention to classification accuracy and efficiency has made the superiority of DL-based algorithms, e.g., automation and simplicity, partially lost. Traditional ML strategies (e.g., the handcrafted features or indicators) and accuracy-aimed strategies with a high trade-off (e.g., the multi-stage CNNs and ensemble of multi-CNNs) are widely used without any training efficiency optimization involved, which may result in suboptimal performance. To address this problem, we propose a fast and simple training CNN framework (named FST-EfficientNet) for RSI-SSC based on an EfficientNet-version2 small (EfficientNetV2-S) CNN model. The whole algorithm flow is completely one-stage and end-to-end without any handcrafted features or discriminators introduced. In the implementation of training efficiency optimization, only several routine data augmentation tricks coupled with a fixed ratio of resolution or a gradually increasing resolution strategy are employed, so that the algorithm's trade-off is very cheap. The performance evaluation shows that our FST-EfficientNet achieves new state-of-the-art (SOTA) records in the overall accuracy (OA) with about 0.8% to 2.7% ahead of all earlier methods on the Aerial Image Dataset (AID) and Northwestern Poly-technical University Remote Sensing Image Scene Classification 45 Dataset (NWPU-RESISC45D). Meanwhile, the results also demonstrate the importance and indispensability of training efficiency optimization strategies for RSI-SSC by DL. In fact, it is not necessary to gain better classification accuracy by completely relying on an excessive trade-off without efficiency. Ultimately, these findings are expected to contribute to the development of more efficient CNN-based approaches in RSI-SSC.