Feature representations, or descriptors, are machines’ chemical language that largely shapes the prediction capability, generalizability and interpretability of machine learning models. To develop a generally applicable descriptor is highly warranted for chemists to deal with conventional prediction tasks in the context of sparsely distributed and small datasets. Inspired by the chemist's vision on molecules, we presented herein an ensemble descriptor, SPOC, curated on the principles of physical organic chemistry that integrates Structure and Physicochemical property (SPOC) of a molecule. SPOC could be readily constructed by combining molecular fingerprints, representing the structure of a given molecule, and molecular physicochemical properties extracted from RDKit or Mordred molecular descriptors. The applicability of SPOC was fully surveyed in a range of well‐structured chemical databases with machine learning tasks varying from regression to classifications.
Nucleophilicity and electrophilicity dictate the reactivity of polar organic reactions. In the past decades, Mayr et al. established a quantitative scale for nucleophilicity (N) and electrophilicity (E), which proved to be a useful tool for the rationalization of chemical reactivity. In this study, a holistic prediction model was developed through a machine-learning approach. rSPOC, an ensemble molecular representation with structural, physicochemical and solvent features, was developed for this purpose. With 1115 nucleophiles, 285 electrophiles, and 22 solvents, the dataset is currently the largest one for reactivity prediction. The rSPOC model trained with the Extra Trees algorithm showed high accuracy in predicting Mayr's N and E parameters with R 2 of 0.92 and 0.93, MAE of 1.45 and 1.45, respectively. Furthermore, the practical applications of the model, for instance, nucleophilicity prediction of NADH, NADPH and a series of enamines showed potential in predicting molecules with unknown reactivity within seconds. An online prediction platform (http://isyn.luoszgroup.com/) was constructed based on the current model, which is available free to the scientific community.
Nucleophilicity and electrophilicity dictate the reactivity of polar organic reactions. In the past decades, Mayr et al. established a quantitative scale for nucleophilicity (N) and electrophilicity (E), which proved to be useful tools for the rationalization of chemical reactivity. In this study, a holistic prediction model was developed through a machine-learning approach. rSPOC, an ensemble molecular representation with structural, physicochemical, and solvent features, was developed for this purpose. With 1115 nucleophiles, 285 electrophiles and 22 solvents, the dataset was currently the largest one for reactivity prediction. The rSPOC model trained with the Extra Trees algorithm showed high accuracy in predicting Mayr’s N and E parameters with R2 of 0.96 and 0.92, MAE of 0.99 and 1.47, respectively. Furthermore, the practical applications of the model, for instance, nucleophilicity prediction of NAD(P)H and a series of enamines showed potential in predicting molecules with unknown reactivity within seconds. An online prediction platform (http://isyn.luoszgroup.com/) was constructed based on the current model, which is available free to the scientific community.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.