This work aims to identify classes of DOI mistakes by analysing the open bibliographic metadata available in Crossref, highlighting which publishers were responsible for such mistakes and how many of these incorrect DOIs could be corrected through automatic processes. By using a list of invalid cited DOIs gathered by OpenCitations while processing the OpenCitations Index of Crossref open DOI-to-DOI citations (COCI) in the past two years, we retrieved the citations in the January 2021 Crossref dump to such invalid DOIs. We processed these citations by keeping track of their validity and the publishers responsible for uploading the related citation data in Crossref. Finally, we identified patterns of factual errors in the invalid DOIs and the regular expressions needed to catch and correct them. The outcomes of this research show that only a few publishers were responsible for and/or affected by the majority of invalid citations. We extended the taxonomy of DOI name errors proposed in past studies and defined more elaborated regular expressions that can clean a higher number of mistakes in invalid DOIs than prior approaches. The data gathered in our study can enable investigating possible reasons for DOI mistakes from a qualitative point of view, helping publishers identify the problems underlying their production of invalid citation data. Also, the DOI cleaning mechanism we present could be integrated into the existing process (e.g. in COCI) to add citations by automatically correcting a wrong DOI. This study was run strictly following Open Science principles, and, as such, our research outcomes are fully reproducible.
The purpose of this research is to find the publishers responsible for the missing citations in COCI (the OpenCitations Index of Crossref open DOI-to-DOI citations) by sending incorrect metadata to Crossref, the publishers to whom such invalid citations point to, and the number of previously invalid citations which are currently valid. Study de si gn/ me thodol ogy Study de si gn/ me thodol ogy In order to find the invalid citations, we use an already generated CSV file, containing the DOIs of invalid citations and their correct form, which is available online. These DOIs along with the COCI REST API can lead us to the responsible and referenced publishers. Fi ndi ngs Fi ndi ngsWe found for each individual publisher 1) the number of incorrect given citations metadata sent, and 2) the number of invalid citations received. We also extracted the total number of invalid citations that have since been corrected. O ri gi na l i ty/ va l ue O ri gi na l i ty/ va l ueThe results of this research may point us to publishers who generally send out incorrect citation metadata and, inversely, those who generally receive invalid citations. These findings can first of all raise awareness of the accuracy of certain publishing houses in managing their metadata (or lack thereof). Moreover, finding these trends and showcasing the labor of the corrections may lead to increasingly valid citations if the proper measures are taken. Re se a rch l i mi ta ti ons/ i mpl i ca ti ons Re se a rch l i mi ta ti ons/ i mpl i ca ti ons Based on the available data for the COCI, there may be a slight bias in our sample, causing some publishers to be incorrectly represented.
A pre l i mi na ry note A pre l i mi na ry note This protocol illustrates the workflow adopted within a scholarly research that operates within the OpenCitations environment, which is an independent infrastructure organization for open scholarship dedicated to the publication of open bibliographic and citation data by the use of Semantic Web (Linked Data) technologies. COCI is the OpenCitations Index of Crossref open DOI-to-DOI citations.
A pre l i mi na ry note A pre l i mi na ry note This protocol illustrates the workflow adopted within a scholarly research that operates within the OpenCitations environment, which is an independent infrastructure organization for open scholarship dedicated to the publication of open bibliographic and citation data by the use of Semantic Web (Linked Data) technologies. COCI is the OpenCitations Index of Crossref open DOI-to-DOI citations. Purpose PurposeThe purpose of this research is to find the publishers responsible for the missing citations in COCI by sending incorrect metadata to Crossref, the publishers to whom such invalid citations point to and the number of previously invalid citations which are currently valid. The ultimate aim would be of contributing to the resolution of this type of problem in order to insert the citations now valid in COCI, and correct those still invalid always in order to increase the number of open citations available and indexed in the OpenCitations project. Study de si gn/ me thodol ogy Study de si gn/ me thodol ogy In the beginning, we use an already generated CSV file, containing the valid citing DOIs and the invalid cited DOIs, which is available from Peroni, S. (2021). Citations to invalid DOI-identified entities obtained from processing DOIto-DOI citations to add in COCI (1.0). Zenodo. https://doi.org/10.5281/ZENODO.4625300. These citations to invalid DOIs have been retrieved while processing Crossref data for adding open citations in COCI, but they have not been added in COCI since they point to a non-resolvable cited document. Two REST API services can be of help: the DOI REST API to check if the invalid cited DOI is now valid; and the Crossref REST API to retrieve the publisher from the prefix of the DOI, both for the cited publications and the citing ones. Fi ndi ngs Fi ndi ngsIn addition to collecting the names of the publishers involved in these missing citations, either as the publisher of the citing article or as the publisher of the cited article, which was sufficient to answer our research questions, we have decided to collect additional information that can help us to get a better picture of the situation. As regards the JSON file, we found for each individual publisher 1) the number of incorrect given citations metadata sent, and 2) the number of invalid citations received. On the other hand, as required by the initial research questions, we also extracted the total number of invalid citations that have since been corrected. O ri gi na l i ty/ va l ue O ri gi na l i ty/ va l ue The results of this research may point us to publishers who generally send out incorrect citation metadata and, inversely, those who generally receive invalid citations. These findings can first of all raise awareness of the accuracy of certain publishing houses in managing their metadata (or lack thereof). Moreover, finding these trends and showcasing the labor of the corrections may lead to increasingly valid citations if the proper measures are taken. Re se a rch l i mi ta ti ons/ i m...
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2025 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.