We developed a Transformer-based artificial neural approach to translate between SMILES and IUPAC chemical notations: Struct2IUPAC and IUPAC2Struct. The overall performance level of our model is comparable to the rule-based solutions. We proved that the accuracy and speed of computations as well as the robustness of the model allow to use it in production. Our showcase demonstrates that a neural-based solution can facilitate rapid development keeping the required level of accuracy. We believe that our findings will inspire other developers to reduce development costs by replacing complex rule-based solutions with neural-based ones.
The rise of deep learning in various scientific and technology areas promotes the development of AI-based tools for information retrieval. Optical recognition of organic structures is a key part of the automated extraction of chemical information. However, this is a challenging task because there is a large variety of representation styles. In this research, we present a Transformer-based artificial neural network to convert images of organic structures to molecular structures. To train the model, we created a comprehensive data generator that stochastically simulates various drawing styles, functional groups, functional group placeholders (R-groups), and visual contamination. We demonstrate that the Transformer-based architecture can gather chemical insights from our generator with almost absolute confidence. That means that, with Transformer, one can fully concentrate on data simulation to build a good recognition model. A web demo of our optical recognition engine is available online at Syntelly platform, and the code for dataset generation is available on GitHub.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.