A bias in health research to favor understanding of diseases as they present in men can have a grave impact on the health of women. This paper reports on a conceptual review of the literature that used machine learning or NLP techniques to interrogate big data for identifying sex-specific health disparities. We searched Ovid MEDLINE, Embase, and PsycINFO in October 2021 using synonyms and indexing terms for (1) "women" or "men" or "sex," (2) "big data" or "artificial intelligence" or "NLP", and (3) "disparities" or "differences." From 902 records, 22 studies met the inclusion criteria and were analyzed. Results demonstrate that the inclusion by sex is inconsistent and often unreported, although the inclusion of men in the included studies is disproportionately less than women. Even though AI and NLP techniques are widely applied in health research, few studies use them to take advatage of unstructured text to investigate sex-related differences or disparities. Researchers are increasingly aware of sex-based data bias, but the process to- wards correction is slow. We reflected on what would be the best practices on using big data analytics to address sex-specific biases in understanding the etiology, diagnosis, and prognosis of diseases.
Background: Since the onset of the COVID-19 pandemic, there has been an unprecedented effort in genomic epidemiology to sequence the SARS-CoV-2 virus and examine its molecular evolution. This has been facilitated by the availability of publicly accessible databases, GISAID and GenBank, which collectively hold millions of SARS-CoV-2 sequence records. However, genomic epidemiology seeks to go beyond phylogenetic analysis by linking genetic information to patient demographics and disease outcomes, enabling a comprehensive understanding of transmission dynamics and disease impact. While these repositories include some patient-related information, such as the location of the infected host, the granularity of this data and the inclusion of demographic and clinical details are inconsistent. Additionally, the extent to which patient-related metadata is reported in published sequencing studies remains largely unexplored. Therefore, it is essential to assess the extent and quality of patient-related metadata reported in SARS-CoV-2 sequencing studies. Moreover, there is limited linkage between published articles and sequence repositories, hindering the identification of relevant studies. Traditional search strategies based on keywords may miss relevant articles. To overcome these challenges, this study proposes the use of an automated classifier to identify relevant articles. Objective: This study aims to conduct a systematic and comprehensive scoping review, along with a bibliometric analysis, to assess the reporting of patient-related metadata in SARS-CoV-2 sequencing studies. Methods: The NIH's LitCovid collection will be used for the machine learning classification, while an independent search will be conducted in PubMed. Data extraction will be conducted using Covidence, and the extracted data will be synthesized and summarized to quantify the availability of patient metadata in the published literature of SARS-CoV-2 sequencing studies. For the bibliometric analysis, relevant data points, such as author affiliations, journal information, and citation metrics, will be extracted. Results: The study will report findings on the extent and types of patient-related metadata reported in genomic viral sequencing studies of SARS-CoV-2. The scoping review will identify gaps in the reporting of patient metadata and make recommendations for improving the quality and consistency of reporting in this area. The bibliometric analysis will uncover trends and patterns in the reporting of patient-related metadata, such as differences in reporting based on study types or geographic regions. Co-occurrence networks of author keywords will also be presented to highlight frequent themes and their associations with patient metadata reporting. Conclusion: This study will contribute to advancing knowledge in the field of genomic epidemiology by providing a comprehensive overview of the reporting of patient-related metadata in SARS-CoV-2 sequencing studies. The insights gained from this study may help improve the quality and consistency of reporting patient metadata, enhancing the utility of sequence metadata and facilitating future research on infectious diseases. The findings may also inform the development of machine learning methods to automatically extract patient-related information from sequencing studies.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.