Background While scientific knowledge of post–COVID-19 condition (PCC) is growing, there remains significant uncertainty in the definition of the disease, its expected clinical course, and its impact on daily functioning. Social media platforms can generate valuable insights into patient-reported health outcomes as the content is produced at high resolution by patients and caregivers, representing experiences that may be unavailable to most clinicians. Objective In this study, we aimed to determine the validity and effectiveness of advanced natural language processing approaches built to derive insight into PCC-related patient-reported health outcomes from social media platforms Twitter and Reddit. We extracted PCC-related terms, including symptoms and conditions, and measured their occurrence frequency. We compared the outputs with human annotations and clinical outcomes and tracked symptom and condition term occurrences over time and locations to explore the pipeline’s potential as a surveillance tool. Methods We used bidirectional encoder representations from transformers (BERT) models to extract and normalize PCC symptom and condition terms from English posts on Twitter and Reddit. We compared 2 named entity recognition models and implemented a 2-step normalization task to map extracted terms to unique concepts in standardized terminology. The normalization steps were done using a semantic search approach with BERT biencoders. We evaluated the effectiveness of BERT models in extracting the terms using a human-annotated corpus and a proximity-based score. We also compared the validity and reliability of the extracted and normalized terms to a web-based survey with more than 3000 participants from several countries. Results UmlsBERT-Clinical had the highest accuracy in predicting entities closest to those extracted by human annotators. Based on our findings, the top 3 most commonly occurring groups of PCC symptom and condition terms were systemic (such as fatigue), neuropsychiatric (such as anxiety and brain fog), and respiratory (such as shortness of breath). In addition, we also found novel symptom and condition terms that had not been categorized in previous studies, such as infection and pain. Regarding the co-occurring symptoms, the pair of fatigue and headaches was among the most co-occurring term pairs across both platforms. Based on the temporal analysis, the neuropsychiatric terms were the most prevalent, followed by the systemic category, on both social media platforms. Our spatial analysis concluded that 42% (10,938/26,247) of the analyzed terms included location information, with the majority coming from the United States, United Kingdom, and Canada. Conclusions The outcome of our social media–derived pipeline is comparable with the results of peer-reviewed articles relevant to PCC symptoms. Overall, this study provides unique insights into patient-reported health outcomes of PCC and valuable information about the patient’s journey that can help health care providers anticipate future needs. International Registered Report Identifier (IRRID) RR2-10.1101/2022.12.14.22283419
Background: There remains significant uncertainty in the definition of the long COVID disease, its expected clinical course, and its impact on daily functioning. Social media platforms can generate valuable insights into patient-reported health outcomes as the content is produced at high resolution by patients and caregivers, representing experiences that may be unavailable to most clinicians. Objective: We aim to determine the validity and effectiveness of advanced NLP approaches built to derive insight into Long COVID-related patient-reported health outcomes from social media platforms. Methodology: We use Transformer-based BERT models to extract and normalize long COVID Symptoms and Conditions (SyCo) from English posts on Twitter and Reddit. Furthermore, we estimate the occurrence and co-occurrence of SyCo terms at any point or across time and locations. Finally, we compare the extracted health outcomes with human annotations and highly utilized clinical outcomes grounded in the medical literature. Result: Based on our findings, the top three most commonly occurring groups of long COVID symptoms are systemic (such as "Fatigue"), neuropsychiatric (such as "Anxiety" and "Brain fog"), and respiratory (such as "Shortness of breath"). Regarding the co-occurring symptoms, the pair of "Fatigue & Headaches" is most common. In addition, we show that other conditions, such as infection, hair loss, and weight loss, as well as mentions of other diseases, such as flu, cancer, or Lyme disease, are among the top reported terms by social media users. Conclusion: The outcome of our social media-derived pipeline is comparable with the outcomes of peer-reviewed articles relevant to long COVID symptoms. Overall, this study provides unique insights into patient-reported health outcomes from long COVID and valuable information about the patient's journey that can help healthcare providers anticipate future needs.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.