The discovery of biocompatible or bioactive nanoparticles for medicinal applications is an expensive and time-consuming process that may be significantly facilitated by incorporating more rational approaches combining both experimental and computational methods. However, it is currently hindered by two limitations: (1) the lack of high-quality comprehensive data for computational modeling and (2) the lack of an effective modeling method for the complex nanomaterial structures. In this study, we tackled both issues by first synthesizing a large library of nanoparticles and obtained comprehensive data on their characterizations and bioactivities. Meanwhile, we virtually simulated each individual nanoparticle in this library by calculating their nanostructural characteristics and built models that correlate their nanostructure diversity to the corresponding biological activities. The resulting models were then used to predict and design nanoparticles with desired bioactivities. The experimental testing results of the designed nanoparticles were consistent with the model predictions. These findings demonstrate that rational design approaches combining high-quality nanoparticle libraries, big experimental data sets, and intelligent computational models can significantly reduce the efforts and costs of nanomaterial discovery.
Numerous chemical data sets have
become available for quantitative
structure–activity relationship (QSAR) modeling studies. However,
the quality of different data sources may be different based on the
nature of experimental protocols. Therefore, potential experimental
errors in the modeling sets may lead to the development of poor QSAR
models and further affect the predictions of new compounds. In this
study, we explored the relationship between the ratio of questionable
data in the modeling sets, which was obtained by simulating experimental
errors, and the QSAR modeling performance. To this end, we used eight
data sets (four continuous endpoints and four categorical endpoints)
that have been extensively curated both in-house and by our collaborators
to create over 1800 various QSAR models. Each data set was duplicated
to create several new modeling sets with different ratios of simulated
experimental errors (i.e., randomizing the activities of part of the
compounds) in the modeling process. A fivefold cross-validation process
was used to evaluate the modeling performance, which deteriorates
when the ratio of experimental errors increases. All of the resulting
models were also used to predict external sets of new compounds, which
were excluded at the beginning of the modeling process. The modeling
results showed that the compounds with relatively large prediction
errors in cross-validation processes are likely to be those with simulated
experimental errors. However, after removing a certain number of compounds
with large prediction errors in the cross-validation process, the
external predictions of new compounds did not show improvement. Our
conclusion is that the QSAR predictions, especially consensus predictions,
can identify compounds with potential experimental errors. But removing
those compounds by the cross-validation procedure is not a reasonable
means to improve model predictivity due to overfitting.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.