Discovering Cohorts of Pregnant Women From Social Media for Safety Surveillance and Analysis

Sarker, Abeed; Chandrashekar, Pramod; Magge, Arjun; Cai, Haitao; Klein, Ari Z.; González, Graciela

doi:10.2196/jmir.8164

Cited by 56 publications

(65 citation statements)

References 28 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In recent work [10], we took the first step towards exploring whether social media mining could be used to complement pregnancy exposure registries as a novel method for observing pregnancies. Considering that 21% of American adults and, more specifically, 36% of Americans between ages 18–29 use Twitter [11], the promise of valuable information directly from the population of interest motivated us to develop and deploy a natural language processing (NLP) and machine learning pipeline that automatically collects and stores the Twitter user timelines —all publicly available posts over time by that user—of women who have reported a pregnancy on Twitter.…”

Section: Introductionmentioning

confidence: 99%

Social media mining for birth defects research: A rule-based, bootstrapping approach to collecting data for rare health-related events on Twitter

Klein

Sarker

Cai

et al. 2018

Journal of Biomedical Informatics

Self Cite

View full text Add to dashboard Cite

Background: Although birth defects are the leading cause of infant mortality in the United States, methods for observing human pregnancies with birth defect outcomes are limited. Objective: The primary objectives of this study were (i) to assess whether rare health-related events—in this case, birth defects—are reported on social media, (ii) to design and deploy a natural language processing (NLP) approach for collecting such sparse data from social media, and (iii) to utilize the collected data to discover a cohort of women whose pregnancies with birth defect outcomes could be observed on social media for epidemiological analysis. Methods: To assess whether birth defects are mentioned on social media, we mined 432 million tweets posted by 112,647 users who were automatically detected via their public announcements of pregnancies on Twitter. To retrieve tweets that mention birth defects, we developed a rule-based, bootstrapping approach, which relies on a lexicon, lexical variants generated from the lexicon entries, regular expressions, post-processing, and manual analysis guided by distributional properties. To identify users whose pregnancies with birth defect outcomes could be observed for epidemiological analysis, inclusion criteria were (i) tweets indicating that the user’s child has a birth defect, and (ii) accessibility to the user’s tweets during pregnancy. We conducted a semi-automatic evaluation to estimate the recall of the tweet-collection approach, and performed a preliminary assessment of the prevalence of selected birth defects among the pregnancy cohort derived from Twitter. Results: We manually annotated 16,822 retrieved tweets, distinguishing tweets indicating that the user’s child has a birth defect (true positives) from tweets that merely mention birth defects (false positives). Inter-annotator agreement was substantial: κ = 0.79 (Cohen’s kappa). Analyzing the timelines of the 646 users whose tweets were true positives resulted in the discovery of 195 users that met the inclusion criteria. Congenital heart defects are the most common type of birth defect reported on Twitter, consistent with findings in the general population. Based on an evaluation of 4,169 tweets retrieved using alternative text mining methods, the recall of the tweet-collection approach was 0.95. Conclusions: Our contributions include (i) evidence that rare health-related events are indeed reported on Twitter, (ii) a generalizable, systematic NLP approach for collecting sparse tweets, (iii) a semi-automatic method to identify undetected tweets (false negatives), and (iv) a collection of publicly available tweets by pregnant users with birth defect outcomes, which could be used for future epidemiological analysis. In future work, the annotated tweets could be used to train machine learning algorithms to automatically identify users reporting birth defect outcomes, enabling the large-scale use of social media mining as a complementary method for such epidemiological research.

show abstract

Section: Introductionmentioning

confidence: 99%

Social media mining for birth defects research: A rule-based, bootstrapping approach to collecting data for rare health-related events on Twitter

Klein

Sarker

Cai

et al. 2018

Journal of Biomedical Informatics

Self Cite

View full text Add to dashboard Cite

show abstract

“…We handcrafted 11 regular expressions to retrieve tweets that mention adverse pregnancy outcomes, from a database containing more than 400 million public tweets posted by more than 100,000 users who have announced their pregnancy on Twitter [7] . These query patterns were designed to account for the various ways adverse pregnancy outcomes may be linguistically expressed on social media—for example, reporting a miscarriage or stillbirth through the use of rainbow baby (Pattern 2) or hashtags such as #babyloss, #pregnancyloss, #iam1in4 , or #waveoflight (Pattern 9), learned through an iterative process of manually reviewing tweets matched by other query patterns [8] .…”

Section: Experimental Design Materials and Methodsmentioning

confidence: 99%

“… Data format Raw, analyzed Parameters for data collection Tweets were collected if they mention miscarriage, stillbirth, preterm birth/premature labor, low birthweight, or neonatal intensive care. Description of data collection Handcrafted regular expressions retrieved 22,912 tweets that mention adverse pregnancy outcomes from a database containing public tweets posted by women who have announced their pregnancy on Twitter [7] . Two professional annotators labeled 8109 of the 22,912 tweets (one random tweet per user) in a binary fashion, distinguishing those potentially reporting that the user has personally experienced the outcome from those that merely mention the outcome.…”

Section: Specifications Tablementioning

confidence: 99%

“…The 6487 tweets were retrieved from a database [7] using 11 handcrafted regular expressions—search patterns that define matching text strings (Supplementary Material). Table 1 presents samples of (slightly modified) tweets in the data set, and total distribution of “outcome” and “non-outcome” tweets for each of the 11 query patterns.…”

Section: Data Descriptionmentioning

confidence: 99%

“…To advance the use of social media data as a complementary resource for epidemiology of adverse pregnancy outcomes, we present a data set of 6487 tweets that mention miscarriage, stillbirth, preterm birth or premature labor, low birthweight, neonatal intensive care, or fetal/infant loss in general. These tweets are a subset of 22,912 tweets retrieved by applying hand-written regular expressions to a database containing more than 400 million public tweets posted by more than 100,000 women who have announced their pregnancy on Twitter [7] . Two professional annotators labeled the 6487 tweets in a binary fashion, distinguishing those potentially reporting that the user has personally experienced the outcome (“outcome” tweets) from those that merely mention the outcome (“non-outcome” tweets).…”

mentioning

confidence: 99%

See 2 more Smart Citations

An annotated data set for identifying women reporting adverse pregnancy outcomes on Twitter

Klein

Gonzalez-Hernandez

2020

Data in Brief

Self Cite

View full text Add to dashboard Cite

Despite the prevalence in the United States of miscarriage [1] , stillbirth [2] , and infant mortality associated with preterm birth and low birthweight [3] , their causes remain largely unknown [4] , [5] , [6] . To advance the use of social media data as a complementary resource for epidemiology of adverse pregnancy outcomes, we present a data set of 6487 tweets that mention miscarriage, stillbirth, preterm birth or premature labor, low birthweight, neonatal intensive care, or fetal/infant loss in general. These tweets are a subset of 22,912 tweets retrieved by applying hand-written regular expressions to a database containing more than 400 million public tweets posted by more than 100,000 women who have announced their pregnancy on Twitter [7] . Two professional annotators labeled the 6487 tweets in a binary fashion, distinguishing those potentially reporting that the user has personally experienced the outcome (“outcome” tweets) from those that merely mention the outcome (“non-outcome” tweets). Inter-annotator agreement was κ = 0.90 (Cohen's kappa). The tweets annotated as “outcome” include 1318 women reporting miscarriage, 94 stillbirth, 591 preterm birth or premature labor, 171 low birthweight, 453 neonatal intensive care, and 356 fetal/infant loss in general. These “outcome” tweets can be used to explore patient experiences and perceptions of adverse pregnancy outcomes, and can direct researchers to the users’ broader timelines—tweets posted by a user over time—for observational studies. Our past work demonstrates the analysis of timelines for selecting a study population [8] and conducting a case-control study [9] of users reporting that their child has a birth defect. For larger-scale studies, the full annotated corpus can be used to train supervised machine learning algorithms to automatically identify additional users reporting adverse pregnancy outcomes on Twitter. We used the annotated corpus to train feature-engineered and deep learning-based classifiers presented in “A natural language processing pipeline to advance the use of Twitter data for digital epidemiology of adverse pregnancy outcomes” [10] .

show abstract

Machine Learning and Natural Language Processing for Geolocation-Centric Monitoring and Characterization of Opioid-Related Social Media Chatter

et al. 2019

Self Cite

View full text Add to dashboard Cite

show abstract

Discovering Cohorts of Pregnant Women From Social Media for Safety Surveillance and Analysis

Cited by 56 publications

References 28 publications

Social media mining for birth defects research: A rule-based, bootstrapping approach to collecting data for rare health-related events on Twitter

Social media mining for birth defects research: A rule-based, bootstrapping approach to collecting data for rare health-related events on Twitter

An annotated data set for identifying women reporting adverse pregnancy outcomes on Twitter

Machine Learning and Natural Language Processing for Geolocation-Centric Monitoring and Characterization of Opioid-Related Social Media Chatter

Contact Info

Product

Resources

About