Summary
In this study, we developed a data mining-based multivariate analysis (MVA) workflow to identify correlations in complex high-dimensional data sets of small size. The research was motivated by the integration analysis of geologic, geophysical, completion, and production data from a 4-square-mile study field located in the Northern Denver-Julesburg (DJ) Basin, Colorado, USA. The goal is to establish a workflow that can extract learnings from a small data set to guide the future development of surrounding acreages. In this research, we propose an MVA workflow, which is modified significantly based on the random forest algorithm and assessed using the R2 score from K-fold cross-validation (CV). The MVA workflow performs significantly better in small data sets compared to traditional feature selection methods. This is because the MVA workflow includes (1) the selection of top-performing feature combinations at each step, (2) iterations embedded, (3) avoidance of random correlation, and (4) the summarization of each feature’s occurrence at the end. When the MVA workflow was initially applied on a complex synthetic small data set that included numerical and categorical variables, linear and nonlinear relationships, relationships within independent variables, and high dimensionality, it correctly identified all correlating variables and outperformed traditional feature selection methods. Following that, a field data set consisting of the information from 23 wells was investigated using the MVA workflow aiming at identifying the key factors that affect the production performance in the study area. The MVA workflow reveals the weak correlation between production and legacy well effect. The results show that the key factors affecting production in this study area are total organic carbon (TOC) percentage, open fracture densities, clay content, and legacy well effect, which should receive significant attention when developing neighboring acreage of the DJ Basin. More importantly, this MVA method can be implemented in other basins. Considering the heterogeneity of unconventional resources, it is worthwhile to identify the key production drivers on a small scale. The outperformance of this MVA method on small data sets makes it possible to provide valuable insights for each specific acreage.