Search citation statements
Paper Sections
Citation Types
Year Published
Publication Types
Relationship
Authors
Journals
A Review of: Clark, B. (2023). Proactive institutional repository collection development techniques: Archiving gold open access articles and metadata retrieved with web scraping. Journal of Library Administration, 63(6), 743–765. https://doi.org/10.1080/01930826.2023.2240190 Objective – To describe a method for collecting gold open access publications from the web and packaging them for batch deposit in an institutional repository. The goal of this project is to expand institutional repository holdings and increase the comprehensiveness of the collection with gold open access content. Design – Web scraping and analysis of institutional repository usage metrics. Setting – A library at a public doctoral university with very high research activity in Alabama, United States. Subjects – Articles and metadata from the Multidisciplinary Digital Publishing Institute (MDPI) website and the Sponsoring Consortium for Open Access Publishing in Participle Physics (SCOAP3) repository. MDPI is an open access publisher of over 400 journals spanning all disciplines. All articles published in MDPI journals are made freely and immediately accessible on the MDPI website. SCOAP3 is a global partnership of libraries, funding agencies, and research centers that support open access publishing in the field of high-energy physics. The SCOAP3 repository contains research funded by the organization and published in open access journals. Methods – The MDPI website and SCOAP3 repository were selected because they contained a substantial amount of scholarship by University of Alabama affiliates. On the MDPI website, an author affiliation search across all journals retrieved University of Alabama publications. The Python library Beautiful Soup was used with the parser package lxml to collect articles and metadata. The first script iterated through the pages of search results, downloaded article PDFs, and wrote abstract page URLs to a text file. The second script collected metadata by iterating through the text file of abstract page URLs, parsing the HTML of each URL, and writing Dublin Core metadata to a CSV file. Articles already archived in the institutional repository were removed from the CSV file, and the remaining metadata were reviewed for errors. To pair each PDF with the correct metadata, the file names of all PDFs were added to the CSV file. Article PDFs and the metadata file were packaged using the DSpace CSV Archive and batch deposited in the University of Alabama’s institutional repository. In SCOAP3, an author affiliation search retrieved University of Alabama publications. The browser automation software Selenium was used to collect articles and metadata. The first script iterated through the pages of search results and wrote article record page URLs to a text file. The second script downloaded article PDFs and extracted DOIs to use for PDF file names. The third script collected metadata by using the article record page URLs to query the SCOAP3 metadata harvesting API and writing MARCXML metadata to a CSV file. To pair each PDF with the correct metadata, the DOI column in the CSV file was duplicated, and the “.pdf” extension added to each DOI. The metadata in the CSV file was reviewed for errors, and citations and keywords were added manually. Articles and the metadata file were packaged and deposited using the MDPI method. The impact of SCOAP3 content on institutional repository downloads from the physics and astronomy collection was measured in the 100 days preceding and following the deposits. Main Results – 1,005 articles with corresponding metadata were collected from the MDPI website and SCOAP3 repository. After removing duplicate articles that were already archived in the University of Alabama institutional repository, 937 articles (272 from MDPI, 665 from SCOAP3) were deposited. The amount of faculty research available in the institutional repository increased from 1,639 articles before the project to 2,513 articles, or 37.3%. 678 articles were added to the physics and astronomy collection, which reflects the fact that most of the deposited articles were from a subject repository. The rest of the deposited articles were from MDPI and spanned various disciplines. The next best represented collections were civil, construction, and environmental engineering (26 articles); biological sciences (26 articles); electrical and computer engineering (24 articles); and geography (22 articles). The SCOAP3 articles also contributed to a significant increase in downloads from the physics and astronomy collection. Total downloads increased from 5,765 in the 100 days preceding the deposits to 7,243 in the 100 days following the deposits, with SCOAP3 articles representing 3,421 downloads, or 47.2%. Conclusion – This project was successful in proactively increasing the amount of scholarship in the institutional repository without faculty or researcher participation. This semi-automated workflow requires considerable technical skills but is manageable for one person. Since the articles and metadata were freely accessible and issued under permissive Creative Commons licenses, there was no need to consult publisher self-archiving policies or solicit permission to copy the articles to the institutional repository. This project did not make any research openly accessible that was otherwise unavailable or behind a paywall, but the added publications contribute to making the institution’s scholarly record more complete. This approach may be particularly helpful for academic library staff looking to build the holdings of a brand-new institutional repository, or for those dealing with an underpopulated institutional repository due to low self-archiving rates. Additional repositories containing a substantial amount of University of Alabama scholarship will be identified and considered for web scraping, to continue expanding the institutional repository holdings. The MDPI website and SCOAP3 repository will also be re-scraped in the future for research added since this project.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2025 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.