BackgroundAs Twitter has become an active data source for health surveillance research, it is important that efficient and effective methods are developed to identify tweets related to personal health experience. Conventional classification algorithms rely on features engineered by human domain experts, and engineering such features is a challenging task and requires much human intelligence. The resultant features may not be optimal for the classification problem, and can make it challenging for conventional classifiers to correctly predict personal experience tweets (PETs) due to the various ways to express and/or describe personal experience in tweets. In this study, we developed a method that combines word embedding and long short-term memory (LSTM) model without the need to engineer any specific features. Through word embedding, tweet texts were represented as dense vectors which in turn were fed to the LSTM neural network as sequences.ResultsStatistical analyses of the results of 10-fold cross-validations of our method and conventional methods indicate that there exist significant differences (p < 0.01) in performance measures of accuracy, precision, recall, F1-score, and ROC/AUC, demonstrating that our approach outperforms the conventional methods in identifying PETs.ConclusionWe presented an efficient and effective method of identifying health-related personal experience tweets by combining word embedding and an LSTM neural network. It is conceivable that our method can help accelerate and scale up analyzing textual data of social media for health surveillance purposes, because of no need for the laborious and costly process of engineering features.
Studies have shown that Twitter can be used for health surveillance, and personal experience tweets (PETs) are an important source of information for health surveillance. To mine Twitter data requires a relatively balanced corpus and it is challenging to construct such a corpus due to the labor-intensive annotation tasks of large data sets. We developed a bootstrap method of finding PETs with the use of the machine learning-based filter. Through a few iterations, our approach can efficiently improve the balance of two class dataset with a reduced amount of annotation work. To demonstrate the usefulness of our method, a PET corpus related to effects caused by 4 dietary supplements was constructed. In 3 iterations, a corpus of 8,770 tweets was obtained from 108,528 tweets collected, and the imbalance of two classes was significantly reduced from 1:31 to 1:3. In addition, two out of three classifiers used showed improved performance over iterations. It is conceivable that our approach can be applied to various other health surveillance studies that use machine learning-based classifications of imbalanced Twitter data.
Health surveillance is an important task to track the happenings related to human health, and one of its areas is pharmacovigilance. Pharmacovigilance tracks and monitors safe use of pharmaceutical products. Pharmacovigilance involves tracking side effects that may be caused by medicines and other health related drugs. Medical professionals have a difficult time collecting this information. It is anticipated that social media could help to collect this data and track side effects. Twitter data can be used for this task given that users post their personal health related experiences on-line. One problem with Twitter data, however, is that it contains a lot of noise. Therefore, an approach is needed to remove the noise. In this paper, several machine learning algorithms including deep neural nets are used to build classifiers that can help to detect these Personal Experience Tweets (PETs). Finally, we propose a method called the Deep Gramulator that improves results. Results of the analysis are presented and discussed.
Twitter, as a social media platform, has become an increasingly useful data source for health surveillance studies, and personal health experiences shared on Twitter provide valuable information to the surveillance. Twitter data are known for their irregular usages of languages and informal short texts due to the 140 character limit, and for their noisiness such that majority of the posts are irrelevant to any particular health surveillance. These factors pose challenges in identifying personal health experience tweets from the Twitter data. In this study, we designed deep neural networks with 3 different architectural configurations, and after training them with a corpus of 8,770 annotated tweets, we used them to predict personal experience tweets from a set of 821 annotate tweets. Our results demonstrated a significant amount of improvement in predicting personal health experience tweets by deep neural networks over that by conventional classifiers: 37.5% in accuracy, 31.1% in precision, and 53.6% in recall. We believe that our method can be utilized in various health surveillance studies using Twitter as a data source.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2025 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.