Research summaryTo what extent do firms rely on basic science in their R&D efforts? Several scholars have sought to answer this and related questions, but progress has been impeded by the difficulty of matching unstructured references in patents to published papers. We introduce an open‐access dataset of references from the front pages of patents granted worldwide to scientific papers published since 1800. Each patent‐paper linkage is assigned a confidence score, which is characterized in a random sample by false negatives versus false positives. All matches are available for download at http://relianceonscience.org. We outline several avenues for strategy research enabled by these new data.Managerial summaryTo what extent do firms rely on basic science in their R&D efforts? Several scholars have sought to answer this and related questions, but progress has been impeded by the difficulty of matching unstructured references in patents to published papers. We introduce an open‐access dataset of references from the front pages of patents granted worldwide to scientific papers published since 1800. Each patent‐paper linkage is assigned a confidence score, and we check a random sample of these confidence scores by hand in order to estimate both coverage (i.e., of the matches we should have found, what percentage did we find) and accuracy (i.e., of the matches we found, what percentage are correct). We outline several avenues for strategy research enabled by these new data.
We curate and characterize a complete set of citations from patents to scientific articles, including nearly 16 million from the full text of USPTO and EPO patents. Combining heuristics and machine learning, we achieve 25% higher performance than machine learning alone. At 99.4% accuracy, coverage of 87.6% is achieved, and coverage above 90% with accuracy above 93%. Performance is evaluated with a set of 5,939 randomly-sampled, cross-verified "known good" citations, which the authors have never seen. We compare these "in-text" citations with the "official" citations on the front page of patents. In-text citations are more diverse temporally, geographically, and topically. They are less self-referential and less likely to be recycled from one patent to the next. That said, in-text citations have been overshadowed by front-page in the past few decades, dropping from 80% of all paper-to-patent citations to less than 40%. In replicating two published articles that use only citations on the front page of patents, we show that failing to capture those in the body text leads to understating the relationship between academic science and commercial invention. All patent-to-article citations, as well as the knowngood test set, are available at http://relianceonscience.org.
We curate and characterize a complete set of citations from patents to scientific articles, including nearly 16 million from the full text of USPTO and EPO patents. Combining heuristics and machine learning, we achieve 25% higher performance than machine learning alone. At 99.4% accuracy, coverage of 87.6% is achieved, and coverage above 90% with accuracy above 93%. Performance is evaluated with a set of 5,939 randomly-sampled, cross-verified "known good" citations, which the authors have never seen. We compare these "in-text" citations with the "official" citations on the front page of patents. In-text citations are more diverse temporally, geographically, and topically. They are less self-referential and less likely to be recycled from one patent to the next. That said, in-text citations have been overshadowed by front-page in the past few decades, dropping from 80% of all paper-to-patent citations to less than 40%. In replicating two published articles that use only citations on the front page of patents, we show that failing to capture those in the body text leads to understating the relationship between academic science and commercial invention. All patent-to-article citations, as well as the knowngood test set, are available at http://relianceonscience.org.
We curate and characterize a complete set of citations from patents to scientific articles, including 16.8 million from the full text of USPTO and EPO patents.Combining hand-tuned heuristics and the GROBID machine-learning package, we achieve much higher performance than machine learning alone. Recall is evaluated with a set of 5939 randomly sampled, cross-verified "known good" citations, which the authors have never seen. At 99.4% precision, we achieve recall rates of 78% for the full test set and 88% for references specified without mistakes. We compare these "in-text" citations with those on the front page of patents. In-text citations are more diverse temporally, geographically, and topically; moreover, they are less self-referential and less likely to be copied from one patent to the next. In-text citations have dropped from two-thirds of all patent-to-article citations half a century ago to about one-third today. In replicating two articles that use only front-page citations, we show that failing to capture in-text citations leads to understating the role of academic science in commercial invention. All patent-to-article citations, the known-good test set, and the source code are available at http://relianceonscience.org.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.