This article contributes to the development of methods for analysing research funding systems by exploring the robustness and comparability of emerging approaches to generate funding landscapes useful for policy making. We use a novel data set of manually extracted and coded data on the funding acknowledgements of 7,510 publications representing UK cancer research in the year 2011 and compare these “reference data” with funding data provided by Web of Science (WoS) and MEDLINE/PubMed. Findings show high recall (around 93%) of WoS funding data. By contrast, MEDLINE/PubMed data retrieved less than half of the UK cancer publications acknowledging at least one funder. Conversely, both databases have high precision (+90%): That is, few cases of publications with no acknowledgment to funders are identified as having funding data. Nonetheless, funders acknowledged in UK cancer publications were not correctly listed by MEDLINE/PubMed and WoS in around 75% and 32% of the cases, respectively. Reference data on the UK cancer research funding system are used as a case study to demonstrate the utility of funding data for strategic intelligence applications (e.g., mapping of funding landscape and co‐funding activity, comparison of funders' research portfolios).
This article contributes to the development of methods for analysing research funding systems by exploring the robustness and comparability of emerging approaches to generate funding landscapes useful for policy making. We use a novel dataset of manually extracted and coded data on the funding acknowledgements of 7,510 publications representing UK cancer research in the year 2011 and compare these 'reference data' with funding data provided by Web of Science (WoS) and MEDLINE/PubMed. Findings show high recall (about 93%) of WoS funding data. By contrast, MEDLINE/PubMed data retrieved less than half of the UK cancer publications acknowledging at least one funder. Conversely, both databases have high precision (+90%): i.e. few cases of publications with no acknowledgement to funders are identified as having funding data. Nonetheless, funders acknowledged in UK cancer publications were not correctly listed by MEDLINE/PubMed and WoS in about 75% and 32% of the cases, respectively. 'Reference data' on the UK cancer research funding system are then used as a case-study to demonstrate the utility of funding data for strategic intelligence applications (e.g. mapping of funding landscape, comparison of funders' research portfolios).
Purpose:The authors aim at testing the performance of a set of machine learning algorithms that could improve the process of data cleaning when building datasets.Design/methodology/approach: The paper is centered on cleaning datasets gathered from publishers and online resources by the use of specific keywords. In this case, we analyzed data from the Web of Science. The accuracy of various forms of automatic classification was tested here in comparison with manual coding in order to determine their usefulness for data collection and cleaning. We assessed the performance of seven supervised classification algorithms (Support Vector Machine (SVM), Scaled Linear Discriminant Analysis, Lasso and elastic-net regularized generalized linear models, Maximum Entropy, Regression Tree, Boosting, and Random Forest) and analyzed two properties: accuracy and recall. We assessed not only each algorithm individually, but also their combinations through a voting scheme. We also tested the performance of these algorithms with different sizes of training data. When assessing the performance of different combinations, we used an indicator of coverage to account for the agreement and disagreement on classification between algorithms. Findings:We found that the performance of the algorithms used vary with the size of the sample for training. However, for the classification exercise in this paper the best performing algorithms were SVM and Boosting. The combination of these two algorithms achieved a high agreement on coverage and was highly accurate. This combination performs well with a small training dataset (10%), which may reduce the manual work needed for classification tasks. Research limitations:The dataset gathered has significantly more records related to the topic of interest compared to unrelated topics. This may affect the performance of some algorithms, especially in their identification of unrelated papers. Practical implications:Although the classification achieved by this means is not completely accurate, the amount of manual coding needed can be greatly reduced by using classification algorithms. This can be of great help when the dataset is big. With the help of accuracy, recall, Originality/value: We analyzed the performance of seven algorithms and whether combinations of these algorithms improve accuracy in data collection. Use of these algorithms could reduce time needed for manual data cleaning.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.