Translating the InChI: adapting neural machine translation to predict IUPAC names from a chemical identifier

Handsel, Jennifer; Matthews, Brian; Knight, Nicola; Coles, Simon J.

doi:10.1186/s13321-021-00535-x

“…We have already pointed out that, leaving out the additional problem of extracting the chemical information from the images, a RNN only achieves a BLEU 4-gram score of 0.86 when translating from the SMILES to the IUPAC name (58). Nomenclature translation has been addressed with architectures based on the novel transformer networks (62), obtaining a practically perfect accuracy (63,64). Also, automatic recognition of molecular graphical depictions is able to correctly translate them to their SMILES representation with a 88% or 96% accuracy by using either a standard encoder-decoder (46) or a transformer (45) network.…”

Section: Discussionmentioning

confidence: 99%

Molecular Identification from AFM images using the IUPAC Nomenclature and Attribute Multimodal Recurrent Neural Networks

Carracedo-Cosme¹,

Romero‐Muñiz²,

Pou³

et al. 2022

Preprint

0

View full text Add to dashboard Cite

Despite being the main tool to visualize molecules at the atomic scale, Atomic Force Microscopy (AFM) with CO-functionalized metal tips is unable to chemically identify the observed molecules. Here we present a strategy to address this challenging task using deep learning techniques. Instead of identifying a finite number of molecules following a traditional classification approach, we define the molecular identification as an image captioning problem. We design an architecture, composed of two multimodal recurrent neural networks, capable of identifying the structure and composition of an unknown molecule using a 3D-AFM image stack as input. The neural network is trained to pro-1

show abstract

“…From the perspective of life science, the properties of molecules and the effects of drugs are mostly determined by their 3D structures [14, 15]. In most current MRL methods, one starts with representing molecules as 1D sequential strings, such as SMILES [16,17,18] and InChI [19,20,21], or 2D graphs [22,11,23,12]. This may limit their ability to incorporate 3D information for downstream tasks.…”

Section: Introductionmentioning

confidence: 99%

Uni-Mol: A Universal 3D Molecular Representation Learning Framework

Zhou

¹

,

Gao

²

,

Ding

³

et al. 2022

Preprint

View full text Add to dashboard Cite

Molecular representation learning (MRL) has gained tremendous attention due to its critical role in learning from limited supervised data for applications like drug design. In most MRL methods, molecules are treated as 1D sequential tokens or 2D topology graphs, limiting their ability to incorporate 3D information for downstream tasks and, in particular, making it almost impossible for 3D geometry prediction or generation. Herein, we propose Uni-Mol, a universal MRL framework that significantly enlarges the representation ability and application scope of MRL schemes. Uni-Mol is composed of two models with the same SE(3)-equivariant transformer architecture: a molecular pretraining model trained by 209M molecular conformations; a pocket pretraining model trained by 3M candidate protein pocket data. The two models are used independently for separate tasks, and are combined when used in protein-ligand binding tasks. By properly incorporating 3D information, Uni-Mol outperforms SOTA in 14/15 molecular property prediction tasks. Moreover, Uni-Mol achieves superior performance in 3D spatial tasks, including protein-ligand binding pose prediction, molecular conformation generation, etc. Finally, we show that Uni-Mol can be successfully applied to the tasks with few-shot data like pocket druggability prediction. The model and data will be made publicly available at \url{https://github.com/dptech-corp/Uni-Mol}

show abstract

“…From the perspective of life science, the properties of molecules and the effects of drugs are mostly determined by their 3D structures [14, 15]. In most current MRL methods, one starts with representing molecules as 1D sequential strings, such as SMILES [16,17,18] and InChI [19,20,21], or 2D graphs [22,11,23,12,24]. This may limit their ability to incorporate 3D information for downstream tasks.…”

Section: Introductionmentioning

confidence: 99%

Uni-Mol: A Universal 3D Molecular Representation Learning Framework

Zhou

¹

,

Gao

²

,

Ding

³

et al. 2022

Preprint

View full text Add to dashboard Cite

Molecular representation learning (MRL) has gained tremendous attention due to its critical role in learning from limited supervised data for applications like drug design. In most MRL methods, molecules are treated as 1D sequential tokens or 2D topology graphs, limiting their ability to incorporate 3D information for downstream tasks and, in particular, making it almost impossible for 3D geometry prediction or generation. Herein, we propose Uni-Mol, a universal MRL framework that significantly enlarges the representation ability and application scope of MRL schemes. Uni-Mol is composed of two models with the same SE(3)-equivariant transformer architecture: a molecular pretraining model trained by 209M molecular conformations; a pocket pretraining model trained by 3M candidate protein pocket data. The two models are used independently for separate tasks, and are combined when used in protein-ligand binding tasks. By properly incorporating 3D information, Uni-Mol outperforms SOTA in 14/15 molecular property prediction tasks. Moreover, Uni-Mol achieves superior performance in 3D spatial tasks, including protein-ligand binding pose prediction, molecular conformation generation, etc. Finally, we show that Uni-Mol can be successfully applied to the tasks with few-shot data like pocket druggability prediction. The model and data will be made publicly available at \url{https://github.com/dptech-corp/Uni-Mol}

show abstract

Translating the InChI: adapting neural machine translation to predict IUPAC names from a chemical identifier

Cited by 17 publications

References 29 publications

Molecular Identification from AFM images using the IUPAC Nomenclature and Attribute Multimodal Recurrent Neural Networks

Molecular Identification from AFM images using the IUPAC Nomenclature and Attribute Multimodal Recurrent Neural Networks

Uni-Mol: A Universal 3D Molecular Representation Learning Framework

Uni-Mol: A Universal 3D Molecular Representation Learning Framework

Contact Info

Product

Resources

About