Traditional machine learning (ML) metrics overestimate model performance for materials discovery. We introduce (1) leave-onecluster-out cross-validation (LOCO CV) and (2) a simple nearestneighbor benchmark to show that model performance in discovery applications strongly depends on the problem, data sampling, and extrapolation. Our results suggest that ML-guided iterative experimentation may outperform standard high-throughput screening for discovering breakthrough materials like high-T c superconductors with ML.Materials informatics (MI), or the application of data-driven algorithms to materials problems, has grown quickly as a field in recent years. 9 Across all of these applications, a training database of simulated or experimentally-measured materials properties serves as input to a ML algorithm that predictively maps features (i.e., materials descriptors) to target materials properties. Ideally, the result of training such models would be the experimental realization of new materials with promising properties. The MI community has produced several such success stories, including thermoelectric compounds, 10,11 shapememory alloys, 12 superalloys, 13 and 3d-printable high-strength aluminum alloys. 14 However, in many cases, a model is itself the output of a study, and the question becomes: to what extent could the model be used to drive materials discovery? Typically, the performance of ML models of materials properties is quantified via cross-validation (CV). CV can be performed either in a single division of the available data into a training set (to build the model) and a test set (to evaluate its performance), or as an ensemble process known as k-fold CV wherein the data are partitioned into k nonoverlapping subsets of nearly equal size (folds) and model performance is averaged across each combination of k-1 training folds and one test fold. Leave-one-out crossvalidation (LOOCV) is the limit where k is the number of total examples in the dataset. Table 1 summarizes some examples of model performance statistics as reported in the aforementioned studies (some studies involved testing multiple algorithms across multiple properties).In Table 1, the reported model performance is uniformly excellent across all studies. A tempting conclusion is that any of these models could be used for one-shot high-throughput screening of large numbers of materials for desired properties. However, as we discuss below, traditional CV has critical shortcomings in terms of quantifying ML model performance for materials discovery.
Issues with traditional crossvalidation for materials discoveryMany ML benchmark problems consist of data classification into discrete bins, i.e., pattern matching. For example, the
Design, System, ApplicationMachine learning (ML) has become a widely-adopted predictive tool for materials design and discovery. Random k-fold cross-validation (CV), the traditional gold-standard approach for evaluating the quality of ML models, is fundamentally mismatched to the nature of materials discovery, and leads to ...