Accelerometer data collected from wearable devices have recently been used to monitor physical activities (PAs) in daily life. While the intensity of PAs can be distinguished with a cut-off approach, it is important to discriminate different behaviors with similar accelerometry patterns to estimate energy expenditure. We aim to overcome the data imbalance problem that negatively affects machine learning-based PA classification by extracting well-defined features and applying undersampling and oversampling methods. We extracted various temporal, spectral, and nonlinear features from wrist-, hip-, and ankle-worn accelerometer data. Then, the influences of undersampilng and oversampling were compared using various ML and DL approaches. Among various ML and DL models, ensemble methods including random forest (RF) and adaptive boosting (AdaBoost) exhibited great performance in differentiating sedentary behavior (driving) and three walking types (walking on level ground, ascending stairs, and descending stairs) even in a cross-subject paradigm. The undersampling approach, which has a low computational cost, exhibited classification results unbiased to the majority class. In addition, we found that RF could automatically select relevant features for PA classification depending on the sensor location by examining the importance of each node in multiple decision trees (DTs). This study proposes that ensemble learning using well-defined feature sets combined with the undersampling approach is robust for imbalanced datasets in PA classification. This approach will be useful for PA classification in the free-living situation, where data imbalance problems between classes are common.