Machine Learning Meets Mechanistic Modelling for Accurate Prediction of Experimental Activation Energies

Jorner, Kjell; Brinck, Tore; Norrby, Per‐Ola; Buttar, David

doi:10.26434/chemrxiv.12758498

Cited by 1 publication

(1 citation statement)

References 58 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Train the model on some clusters and then evaluate performance on unseen clusters that should be dissimilar to the clusters used for training. Although measuring performance on chemically dissimilar compounds/clusters is not a new concept (Bilodeau et al, 2023;Durdy et al, 2022;Heinen et al, 2021;Jorner et al, 2021;Meredig et al, 2018;Stuyver & Coley, 2022;Terrones et al, 2023;Tricarico et al, 2022), there are a myriad of choices for the first two steps; our software incorporates many popular representations and similarity metrics to give users freedom to easily explore which combination is suitable for their needs.…”

Section: Statement Of Needmentioning

confidence: 99%

Machine Learning Validation via Rational Dataset Sampling with astartes

Burns,

Spiekermann,

Bhattacharjee

et al. 2023

JOSS

View full text Add to dashboard Cite

Machine Learning (ML) has become an increasingly popular tool to accelerate traditional workflows. Critical to the use of ML is the process of splitting datasets into training, validation, and testing subsets that are used to develop and evaluate models. Common practice in the literature is to assign these subsets randomly. Although this approach is fast and efficient, it only measures a model's capacity to interpolate. Testing errors from random splits may be overly optimistic if given new data that is dissimilar to the scope of the training set; thus, there is a growing need to easily measure performance for extrapolation tasks. To address this issue, we report astartes, an open-source Python package that implements many similarityand distance-based algorithms to partition data into more challenging splits. Separate from astartes, users can then use these splits to better assess out-of-sample performance with any ML model of choice. This publication focuses on use-cases within cheminformatics. However, astartes operates on arbitrary vector inputs, so its principals and workflow are generalizable to other ML domains as well. astartes is available via the Python package managers pip and conda and is publicly hosted on GitHub (github.com/JacksonBurns/astartes).

show abstract