The chapter summarizes the practical experience of integrating genetic programming and statistical modeling at The Dow Chemical Company. A unique methodology for using Genetic Programming in statistical modeling of designed and undesigned data is described and illustrated with successful industrial applications. As a result of the synergistic efforts, the building technique has been improved and the model development cost and time can be significantly reduced. In case of designed data Genetic Programming reduced costs by suggesting transformations as an alternative to doing additional experimentation. In case of undesigned data Genetic Programming was instrumental in reducing the model building costs by providing alternative models for consideration.
In the last few years, high-throughput reactors have small received significant attention due to the potential they offer for fast material development. While many experimental design techniques are proposed, statistical issues related to experimentation in this type of equipment are emerging. One of the experimental design techniques needed is the split-plot approach, given the randomization restrictions imposed by the equipment. This paper presents the use of split-plot experimental designs in a high-throughput reactor. We discuss the unique error structure of these designs and the special statistical analysis that considers two different types of errors. A case study in the Dow Chemical Company is presented. The main advantage of the split-plot approach related to high throughput is that reactor-well utilization can be maximized, while randomization restrictions can be addressed correctly and simultaneously. The results obtained indicate the success of this strategy in maximizing the chance of detecting a lead and making the right conclusions, which is of key importance given the speed of data generation of high-throughput reactors.
Symbolic regression based on Pareto Front GP is the key approach for generating high-performance parsimonious empirical models acceptable for industrial applications. The paper addresses the issue of finding the optimal parameter settings of Pareto Front GP which direct the simulated evolution toward simple models with acceptable prediction error. A generic methodology based on statistical design of experiments is proposed. It includes statistical determination of the number of replicates by half-width confidence intervals, determination of the significant inputs by fractional factorial design of experiments, approaching the optimum by steepest ascent/descent, and local exploration around the optimum by Box Behnken or by central composite design of experiments. The results from implementing the proposed methodology to a small-sized industrial data set show that the statistically significant factors for symbolic regression, based on Pareto Front GP, are the number of cascades, the number of generations, and the population size. A second order regression model with high R 2 of 0.97 includes the three parameters and their optimal values have been defined. The optimal parameter settings were validated with a separate small sized industrial data set. The optimal settings are recommended for symbolic regression applications using data sets with up to 5 inputs and up to 50 data points.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.