By analyzing the scaffold content of the CAS Registry, we attempt to characterize in a comprehensive way the structural diversity of organic chemistry. The scaffold of a molecule is taken to be its framework, defined as all its ring systems and all the linkers that connect them. Framework data from more than 24 million organic compounds is analyzed. The distribution of frameworks among compounds is found to be top-heavy, i.e., a small percentage of frameworks occur in a large percentage of compounds. When frameworks are analyzed at the graph level, an even more top-heavy distribution is found: half of the compounds can be described by only 143 framework shapes. The most significant finding is that the framework distribution conforms almost exactly to a power law. This suggests that the more often a framework has been used as the basis for a compound, the more likely it is to be used in another compound. This may be explained by the cost of synthesis: making a new derivative of a framework is probably less costly if many other derivatives are known. We believe this power law is evidence that the minimization of synthetic cost has been a key factor in shaping the known universe of organic chemistry.
No abstract
A large set of organic compounds extracted from the CAS Registry is analyzed to study recent changes in structural diversity. The diversity is characterized using the framework content of the compounds; the framework of a molecule is the scaffold consisting of all its ring systems and all the chain fragments connecting them. The compounds are partitioned based on their year of first report in the literature, which allows framework occurrence frequencies to be compared across a 10-year interval. The results are consistent with a process in which frameworks with the greatest frequency of use in the past are the most likely to be used again, but it is also found that the frequency ordering changes over time. These fluctuations in ordering are attributed to stochastic factors, scientific and economic, that can affect how chemical space is explored. Framework diversity is found to have increased over time despite the extensive reuse of a relatively small number of frameworks; this increase is due to the large number of new frameworks. The long tail of the framework distribution, composed of frameworks that occur in few compounds or only one compound, is found to be a large and growing part of framework space.
A new method for organizing chemical rings based on their topology is presented. It uses three simple descriptors that characterize separate aspects of ring topology. These descriptors are integers and can thus be interpreted as the coordinates of discrete cells in a three-dimensional space. The descriptor values of any ring topology correspond to the coordinates of some cell. A database of rings can be distributed in this descriptor space by assigning each of them to the corresponding cell. This approach is applied to a database of 40 182 different ring topologies, derived from a comprehensive collection of chemical rings extracted from the CAS Registry File. This database is distributed among 7387 cells, and the population statistics and spatial distribution of these cells are discussed. An examination of selected cells shows that ring topologies which are similar tend to be close together in descriptor space. Some results of using this space to study ring diversity are presented. It is found that the distribution of the ring-topology database is not highly compact but has many significant voids. It is also found that the distribution of medicinally relevant rings in this space shows the influence of certain structural constraints on drug molecules.
Measuring innovation in the pharmaceutical industry is challenging. Counts of new molecular entities (NMEs) approved by the Food and Drug Administration (FDA) are commonly used, but this measure only gauges quantity not innovativeness. A new indicator of innovation for small molecule and peptide drugs based on structural novelty is proposed and used to analyze recent trends in pharmaceutical innovation. We show pharmaceutical innovation has significantly increased over the last several decades despite recent concerns over an innovation crisis and find Pioneers (a NME whose shape and scaffold were not used in any previously FDA-approved drugs) are significantly more likely to be the source of promising new therapies. Analysis of the underlying source of structural innovation indicates that scaffolds first reported in the CAS REGISTRY five or less years prior to their Investigational New Drug application (IND) or on scaffolds populated with 50 or less other compounds at the time of IND tend to be the main source of Pioneers. Our analysis also shows a widening structural innovation gap between large pharmaceutical companies (Big Pharma) and the rest of the ecosystem even though the number of Big Pharma originated Pioneers has increased.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.