Molecular Structure Extraction from Documents Using Deep Learning

Staker, Joshua; Marshall, Kyle; Abel, Robert; McQuaw, Carolyn M.

doi:10.1021/acs.jcim.8b00669

Cited by 84 publications

(76 citation statements)

References 23 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Moreover, as the need for harvesting the large amounts of published data grows, the demand for methods for easily mining structures from papers and patent data is also growing. Optical Character Recognition (OCR) systems, relying on a variety of ML and probabilistic pattern recognition techniques, were created to translate 2D depictions of chemical structures to standard chemical representations [ 146 – 148 ]. Nonetheless, the development of OCR systems can be hindered by the images’ resolutions, the computational interpretations of chemical abbreviations, and the nature of the image representation, which can be embedded in text, in figures containing multiple structures, or in reaction pathways, and can be represented as either a skeletal formula or a Markush structure.…”

Section: Ai Applications Within Drug Discovery Using Molecular Represmentioning

confidence: 99%

Molecular representations in AI-driven drug discovery: a review and practical guide

et al. 2020

View full text Add to dashboard Cite

The technological advances of the past century, marked by the computer revolution and the advent of high-throughput screening technologies in drug discovery, opened the path to the computational analysis and visualization of bioactive molecules. For this purpose, it became necessary to represent molecules in a syntax that would be readable by computers and understandable by scientists of various fields. A large number of chemical representations have been developed over the years, their numerosity being due to the fast development of computers and the complexity of producing a representation that encompasses all structural and chemical characteristics. We present here some of the most popular electronic molecular and macromolecular representations used in drug discovery, many of which are based on graph representations. Furthermore, we describe applications of these representations in AI-driven drug discovery. Our aim is to provide a brief guide on structural representations that are essential to the practice of AI in drug discovery. This review serves as a guide for researchers who have little experience with the handling of chemical representations and plan to work on applications at the interface of these fields.

show abstract

Section: Ai Applications Within Drug Discovery Using Molecular Represmentioning

confidence: 99%

Molecular representations in AI-driven drug discovery: a review and practical guide

et al. 2020

View full text Add to dashboard Cite

show abstract

“…In 2019, Staker et al [ 41 ] presented a data-driven, deep learning based approach for OCSR called Molecular Structure Extraction from Documents Using Deep Learning (MSE-DUDL). The system uses two types of networks in the backend: a segmentation network and a structure prediction network.…”

Section: Machine-learning-based Systemsmentioning

confidence: 99%

A review of optical chemical structure recognition tools

Rajan

Brinkhaus

Zielesny³

et al. 2020

J Cheminform

View full text Add to dashboard Cite

Structural information about chemical compounds is typically conveyed as 2D images of molecular structures in scientific documents. Unfortunately, these depictions are not a machine-readable representation of the molecules. With a backlog of decades of chemical literature in printed form not properly represented in open-access databases, there is a high demand for the translation of graphical molecular depictions into machine-readable formats. This translation process is known as Optical Chemical Structure Recognition (OCSR). Today, we are looking back on nearly three decades of development in this demanding research field. Most OCSR methods follow a rule-based approach where the key step of vectorization of the depiction is followed by the interpretation of vectors and nodes as bonds and atoms. Opposed to that, some of the latest approaches are based on deep neural networks (DNN). This review provides an overview of all methods and tools that have been published in the field of OCSR. Additionally, a small benchmark study was performed with the available open-source OCSR tools in order to examine their performance.

show abstract

“…To the best of our knowledge, only two published methods were fully based on machine learning applied on the raw image data. MSE-DUDL 14 was published in 2019. It contains a segmentation network to extract molecule images from other components of the input page, coupled to a molecular recognition network.…”

Section: Related Workmentioning

confidence: 99%

Img2Mol - Accurate SMILES Recognition from Molecular Graphical Depictions

Clevert

Lê²,

Winter³

et al. 2021

Preprint

View full text Add to dashboard Cite

<p>Automatic recognition of the molecular content of a molecule’s graphical depiction is an extremely challenging problem that remains largely unsolved despite decades of research. Recent advances in neural machine translation enable the auto-encoding of molecular structures in a continuous vector space of fixed size (latent representation) with low reconstruction errors. In this paper, we present a fast and accurate model combining a deep convolutional neural network learning from molecule depictions and a pre-trained decoder that translates the latent representation into the SMILES representation of the molecules. This combination allows to precisely infer a molecular structure from an image. Our rigorous evaluation show that Img2Mol is able to correctly translate up to 88% of the molecular depictions into their SMILES representation. A pretrained version of Img2Mol is made publicly available on GitHub for non-commercial users.</p>

show abstract

Molecular Structure Extraction from Documents Using Deep Learning

Cited by 84 publications

References 23 publications

Molecular representations in AI-driven drug discovery: a review and practical guide

Molecular representations in AI-driven drug discovery: a review and practical guide

A review of optical chemical structure recognition tools

Img2Mol - Accurate SMILES Recognition from Molecular Graphical Depictions

Contact Info

Product

Resources

About