Objective This study aims at reviewing novel coronavirus disease (COVID-19) datasets extracted from PubMed Central articles, thus providing quantitative analysis to answer questions related to dataset contents, accessibility and citations. Methods We downloaded COVID-19-related full-text articles published until 31 May 2020 from PubMed Central. Dataset URL links mentioned in full-text articles were extracted, and each dataset was manually reviewed to provide information on 10 variables: (1) type of the dataset, (2) geographic region where the data were collected, (3) whether the dataset was immediately downloadable, (4) format of the dataset files, (5) where the dataset was hosted, (6) whether the dataset was updated regularly, (7) the type of license used, (8) whether the metadata were explicitly provided, (9) whether there was a PubMed Central paper describing the dataset and (10) the number of times the dataset was cited by PubMed Central articles. Descriptive statistics about these seven variables were reported for all extracted datasets. Results We found that 28.5% of 12 324 COVID-19 full-text articles in PubMed Central provided at least one dataset link. In total, 128 unique dataset links were mentioned in 12 324 COVID-19 full text articles in PubMed Central. Further analysis showed that epidemiological datasets accounted for the largest portion (53.9%) in the dataset collection, and most datasets (84.4%) were available for immediate download. GitHub was the most popular repository for hosting COVID-19 datasets. CSV, XLSX and JSON were the most popular data formats. Additionally, citation patterns of COVID-19 datasets varied depending on specific datasets. Conclusion PubMed Central articles are an important source of COVID-19 datasets, but there is significant heterogeneity in the way these datasets are mentioned, shared, updated and cited.
Coronavirus disease of 2019 (COVID-19) has impacted the world in unprecedented ways since first emerging in December 2019. In the last two years, the scientific community has made an enormous effort to understand COVID-19 and potential interventions. As of June 15, 2021, there were more than 140,000 COVID-19 focused manuscripts on PubMed and preprint servers, such as medRxiv and BioRxiv. Preprints, which constitute more than 15% of all manuscripts, may contain more up-to-date research findings compared to published papers, due to the sometimes lengthy timeline between manuscript submission and publication. Including preprints in systematic reviews and meta-analyses thus has the potential to improve the timeliness of reviews. However, there is no clear guideline on whether preprints should be included in systematic reviews and meta-analyses. Using a prototypical example of a rapid systematic review examining the comparative effectiveness of COVID-19 therapeutics, we propose including all preprints in the systematic review by assigning them a weight we term the "confidence score". Motivated by our observation that, unlike the traditional journal submission process which is unobserved, the timeline from submission to publication for a preprint can be observed and can be modeled as a time-to-event outcome. This observation provides a unique opportunity to model and quantify the probability that a preprint will be published, which can be used as a confidence score to weight preprints in systematic reviews and meta-analyses. To obtain the confidence score, we propose a novel survival cure model, which incorporates both the time from posting to publication for a preprint, and key characteristics of the study described in the content of the preprint. Using meta data from 158 preprints on evaluating therapeutic options for COVID-19 posted through 09/03/2020, we demonstrate the utility of the confidence score in weighting of preprints in a systematic review. Our proposed method has the potential to advance timely systematic reviews of the evidence examining COVID-19 and other clinical conditions with rapidly evolving evidence bases by providing an approach for inclusion of unpublished manuscripts.
Objective: Representation learning in the context of biological concepts involves acquiring their numerical representations through various sources of biological information, such as sequences, interactions, and literature. This study has conducted a comprehensive systematic review by analyzing both quantitative and qualitative data to provide an overview of this field. Methods: Our systematic review involved searching for articles on the representation learning of biological concepts in PubMed and EMBASE databases. Among the 507 articles published between 2015 and 2022, we carefully screened and selected 65 papers for inclusion. We then developed a structured workflow that involved identifying relevant biological concepts and data types, reviewing various representation learning techniques, and evaluating downstream applications for assessing the quality of the learned representations. Results: The primary focus of this review was on the development of numerical representations for gene/DNA/RNA entities. We have found Word2Vec to be the most commonly used method for biological representation learning. Moreover, several studies are increasingly utilizing state-of-the-art large language models to learn numerical representations of biological concepts. We also observed that representations learned from specific sources were typically used for single downstream applications that were relevant to the source. Conclusion: Existing methods for biological representation learning are primarily focused on learning representations from a single data type, with the output being fed into predictive models for downstream applications. Although there have been some studies that have explored the use of multiple data types to improve the performance of learned representations, such research is still relatively scarce. In this systematic review, we have provided a summary of the data types, models, and downstream applications used in this task.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2025 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.