Background: Tweets can provide broad, real time perspectives about health and medical diagnoses that can inform disease surveillance in geographic regions. Less is known however about how much individuals post about common health conditions or what they post about.
Objective:We sought to collect and analyze tweets from one state about high prevalence health conditions and characterize tweet volume and content.Methods: We collected 408,296,620 tweets originating in Pennsylvania from 2012-2015 and compared the prevalence of 14 common diseases to the frequency of disease mentions on Twitter. We identified and corrected bias induced due to variance in disease term specificity and used the machine learning approach of differential language analysis to determine the content (words and themes) most highly correlated with each disease.Results: Common disease terms were included in 226,802 tweets. Posts about breast cancer (22.5% messages, 2.4% prevalence) and diabetes (23.1% messages, 17.2% prevalence) were overrepresented on Twitter relative to disease prevalence, while hypertension (9.9% messages, 36.3% prevalence), COPD (0.9% messages, 8.5% prevalence), and heart disease (7.8% messages, 19.4% prevalence) were underrepresented. The content of messages also varied by disease. Personal experience messages accounted for 12% of prostate cancer tweets and 24% of asthma tweets. Awareness themed tweets were more often about breast cancer (23%) than asthma (6%). Tweets about risk factors were more often about heart disease (10%) than lymphoma (2%).Conclusions: Twitter provides a window into the online visibility of diseases and how the volume of online content about diseases varies by condition. Further, the potential value in tweets is in the rich content they provide about individuals' perspective about diseases (e.g. personal experiences, awareness, risk factors) that are not otherwise easily captured through traditional surveys or administrative data.