Several pool-based active learning algorithms were employed to model potential energy surfaces (PESs) with a minimum number of electronic structure calculations. Among these algorithms, the class of uncertainty-based algorithms are popular. Their key principle is to query molecular structures corresponding to high uncertainties in their predictions. We empirically show that this strategy is not optimal for nonuniform data distributions as it collects many structures from sparsely sampled regions, which are less important to applications of the PES. We exploit a simple stochastic algorithm to correct for this behavior and implement it using regression trees, which have relatively small computational costs. We show that this algorithm requires around half the data to converge to the same accuracy than the uncertainty-based algorithm query-by-committee. Simulations on a 6D PES of pyrrole(H 2 O) show that < 15 000 configurations are enough to build a PES with a generalization error of 16 cm −1 , whereas the final model with around 50 000 configurations has a generalization error of 11 cm −1 .
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.