Background
As the process of producing official health statistics for lifestyle diseases is slow, researchers have explored using Web search data as a proxy for lifestyle disease surveillance. Existing studies, however, are prone to at least one of the following issues: ad-hoc keyword selection, overfitting, insufficient predictive evaluation, lack of generalization, and failure to compare against trivial baselines.
Objective
The aims of this study were to (1) employ a corrective approach improving previous methods; (2) study the key limitations in using Google Trends for lifestyle disease surveillance; and (3) test the generalizability of our methodology to other countries beyond the United States.
Methods
For each of the target variables (diabetes, obesity, and exercise), prevalence rates were collected. After a rigorous keyword selection process, data from Google Trends were collected. These data were denormalized to form spatio-temporal indices. L1-regularized regression models were trained to predict prevalence rates from denormalized Google Trends indices. Models were tested on a held-out set and compared against baselines from the literature as well as a trivial last year equals this year baseline. A similar analysis was done using a multivariate spatio-temporal model where the previous year’s prevalence was included as a covariate. This model was modified to create a time-lagged regression analysis framework. Finally, a hierarchical time-lagged multivariate spatio-temporal model was created to account for subnational trends in the data. The model trained on US data was, then, applied in a transfer learning framework to Canada.
Results
In the US context, our proposed models beat the performances of the prior work, as well as the trivial baselines. In terms of the mean absolute error (MAE), the best of our proposed models yields 24% improvement (0.72-0.55; P<.001) for diabetes; 18% improvement (1.20-0.99; P=.001) for obesity, and 34% improvement (2.89-1.95; P<.001) for exercise. Our proposed across-country transfer learning framework also shows promising results with an average Spearman and Pearson correlation of 0.70 for diabetes and 0.90 and 0.91 for obesity, respectively.
Conclusions
Although our proposed models beat the baselines, we find the modeling of lifestyle diseases to be a challenging problem, one that requires an abundance of data as well as creative modeling strategies. In doing so, this study shows a low-to-moderate validity of Google Trends in the context of lifestyle disease surveillance, even when applying novel corrective approaches, including a proposed denormalization scheme. We envision qualitative analyses to be a more practical use of Google Trends in the context of lifestyle disease surveillance. For the quantitative analyses, the highest utility of using Google Trends is in the context of transfer learning where low-resource countries could benefit from high-resource countries by using proxy models.