An open-source framework is presented for the development and evaluation of machine learning (ML) assisted Data-Driven models of carbon dioxide (CO2) enhanced oil recovery (EOR) processes to predict oil production and CO2 retention. This framework generated inputs and outputs for two cases of CO2 water alternating gas (WAG) injection using Python packages and a reservoir simulator, which were subsequently used to train and test the supervised learning algorithms. The main objective was to increase the speed, robustness, and accuracy of predicting oil recovery and CO2 retention using a complete open-source approach combining Python programming, reservoir simulation, and ML techniques. This framework incorporated the reservoir model of the SPE5 benchmark study. The geometry was built using the pyopmnearwell Python package, and the simulations were run in the open-source Open Porous Media (OPM) Flow simulator. The permeability and porosity of the top layer and the gas injection rate were selected as variable input parameters to generate different settings for which the cumulative oil recovery and CO2 retention were determined using the simulator. These inputs and outputs formed the training and test dataset created for the ML model. Finally, the algorithms were optimized through hyperparameter tuning to enhance the predictive scoring metrics R-squared and Root Mean Square Error (RMSE).
Two cases were created with one injector and one producer to develop and evaluate the predictive models. Case 1 was restricted to keeping the WAG injection of 20 years with injection slugs of 3 months, and Case 2 aimed to maintain the same injected volumes for each simulation job. The reservoir simulator allowed us to produce representative results, which were then used to generate a data set for training, testing, and validating the ML algorithms. Based on the predictive scoring metrics, Gradient Boosting and Random Forest regression algorithms performed best, followed by the Decision Tree regressor. In contrast, the K-nearest neighbors regressor had a bad performance. Furthermore, two approaches for hyperparameter tuning were used to get the algorithms’ hyperparameter values, improving their RMSE or R-squared. The optimal model architecture (i.e., the best number of estimators or neighbors) was determined by the hyperparameter tuning to increase the accuracy or reduce the error of the predictions. The predictions of the data-driven models generated by DT, RF, and GBR were reliable based on the accuracy metrics of the trained and tested models, exhibiting values for R-squared above 0.93 and RMSE below 0.05. This innovative and robust approach presents a powerful tool for predicting and assessing the sensitivity of parameters of CO2 EOR projects, delivering accuracy and speed compared to existing methods.