The monitoring of anthropogenic CO2 emissions, which increase the atmospheric CO2 concentration, plays the most important role in the management of emission reduction and control. With the massive increase in satellite-based observation data related to carbon emissions, a data-driven machine learning method has great prospects for predicting anthropogenic CO2 emissions. Training samples, which are used to model predictions of anthropogenic CO2 emissions through machine learning algorithms, play a key role in obtaining accurate predictions for the spatial heterogeneity of anthropogenic CO2 emissions. We propose an approach for predicting anthropogenic CO2 emissions using the training datasets derived from the clustering of the atmospheric CO2 concentration and the segmentation of emissions to resolve the issue of the spatial heterogeneity of anthropogenic CO2 emissions in machine learning modeling. We assessed machine learning algorithms based on decision trees and gradient boosting (GBDT), including LightGBM, XGBoost, and CatBoost. We used multiple parameters related to anthropogenic CO2-emitting activities as predictor variables and emission inventory data from 2019 to 2021, and we compared and verified the accuracy and effectiveness of different prediction models based on the different sampling methods of training datasets combined with machine learning algorithms. As a result, the anthropogenic CO2 emissions predicted by CatBoost modeling from the training dataset derived from the clustering analysis and segmentation method demonstrated optimal prediction accuracy and performance for revealing anthropogenic CO2 emissions. Based on a machine learning algorithm using observation data, this approach for predicting anthropogenic CO2 emissions could help us quickly obtain up-to-date information on anthropogenic CO2 emissions as one of the emission monitoring tools.