Despite their growing popularity among neural network practitioners, ensemble methods have not been widely adopted in structure-activity and structure-property correlation. Neural networks are inherently unstable, in that small changes in the training set and/or training parameters can lead to large changes in their generalization performance. Recent research has shown that by capitalizing on the diversity of the individual models, ensemble techniques can minimize uncertainty and produce more stable and accurate predictors. In this work, we present a critical assessment of the most common ensemble technique known as bootstrap aggregation, or bagging, as applied to QSAR and QSPR. Although aggregation does offer definitive advantages, we demonstrate that bagging may not be the best possible choice and that simpler techniques such as retraining with the full sample can often produce superior results. These findings are rationalized using Krogh and Vedelsby's decomposition of the generalization error into a term that measures the average generalization performance of the individual networks and a term that measures the diversity among them. For networks that are designed to resist over-fitting, the benefits of aggregation are clear but not overwhelming.
We present a new feature selection algorithm for structure-activity and structure-property correlation based on particle swarms. Particle swarms explore the search space through a population of individuals that adapt by returning stochastically toward previously successful regions, influenced by the success of their neighbors. This method, which was originally intended for searching multidimensional continuous spaces, is adapted to the problem of feature selection by viewing the location vectors of the particles as probabilities and employing roulette wheel selection to construct candidate subsets. The algorithm is applied in the construction of parsimonious quantitative structure-activity relationship (QSAR) models based on feed-forward neural networks and is tested on three classical data sets from the QSAR literature. It is shown that the method compares favorably with simulated annealing and is able to identify a better and more diverse set of solutions given the same amount of simulation time.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.