Predicting the properties of a chemical molecule is of great importance in many applications, including drug discovery and material design. Machine learning-based models promise to enable more accurate and faster molecular property predictions than the current state-of-the-art techniques, such as Density Functional Theory calculations or wet-lab experiments. Various supervised machine learning models, including graph neural nets, have demonstrated promising performance in molecular property prediction tasks. However, the vast chemical space and the limited availability of property labels make supervised learning challenging, calling for learning a general-purpose molecular representation. Recently, unsupervised transformerbased language models pre-trained on large unlabeled corpus have produced state-of-the-art results in many downstream natural language processing tasks. Inspired by this development, we present molecular embeddings obtained by training an efficient transformer encoder model, MoLFormer, which uses rotary positional embeddings. This model employs a linear attention mechanism, coupled with highly distributed training, on SMILES sequences of 1.1 billion unlabeled molecules from the PubChem and ZINC datasets.Experiments show that utilizing the learned molecular representation outperforms existing baselines on downstream tasks, including supervised and self-supervised graph neural net baselines and language models, on several classification and regression tasks from ten benchmark datasets while performing competitively on two others. Further analyses, specifically through the lens of attention, demonstrate that MoLFormer trained on chemical SMILES indeed learns the spatial relationships between atoms within a molecule. These results provide encouraging evidence that the large-scale molecular language models can capture sufficient chemical and structural information to predict various distinct molecular properties, including quantum-chemical properties.
MainMachine Learning (ML) has emerged as an appealing, computationally efficient approach for predicting molecular properties, with implications in drug discovery and material engineering. ML models for molecules can be trained directly on pre-defined chemical descriptors, such as unsupervised molecular fingerprints 1 , or hand-derived derivatives of geometric features such as a Coulomb Matrix (CM) 2 . However, more recent ML models have focused on automatically learning the features either from the natural graphs that encode the connectivity information or from the line annotations of molecular structures, such as the popular SMILES 3 (Simplified Molecular-Input Line Entry System) representation. SMILES defines a character string representation of a molecule by performing a depth-first pre-order spanning tree traversal of the molecular graph, generating symbols for each atom, bond, tree-traversal decision, and broken cycles. Therefore, the resulting character string corresponds to a flattening of a spanning tree of the molecular graph. Learning on SMIL...