BACKGROUND Each year, influenza affects 3 to 5 million people and causes 290,000 to 650,000 fatalities worldwide. To reduce the fatalities caused by influenza, several countries have established influenza surveillance systems to collect early-warning data. However, proper and timely warnings are hindered by a 1 to 2 weeks delay between the actual disease outbreaks and the publication of surveillance data. To avoid this delay of traditional monitoring methods, novel methods have been proposed for influenza surveillance and prediction by using real-time internet data (such as search queries, microblogging, and news). Some of the currently popular approaches extract online data and use machine learning to predict influenza occurrences in a classification mode. However, many of these methods extract training data subjectively, and it is difficult to capture the latent characteristics of the data correctly. There is a critical need to devise new approaches that focus on extracting training data by reflecting the latent characteristics of the data. OBJECTIVE In this paper, we propose an effective training data extraction method that reflects the hidden features and improves the performance by filtering and selecting only the keywords related to influenza before the prediction. METHODS Although the word embeddings provide a distributed representation of words by encoding the hidden relationships between various tokens, we enhance the word embeddings by selecting keywords related to the influenza outbreak and sorting the extracted keywords using the Pearson correlation coefficient (PCC) in order of correlation with the influenza outbreak. The keyword extraction process is followed by a predictive model based on long short-term memory (LSTM) that predicts the influenza outbreak. To assess the performance of the proposed predictive model, we use and compare a variety of word embeddings. RESULTS Word embeddings without our proposed sorting process showed 0.8705 prediction accuracy when 50.2 keywords were selected on average. On the other hand, word embeddings using our proposed sorting process showed 0.8868 prediction accuracy and 12.6% prediction accuracy improvement although smaller amount of training data are selected with only 20.6 keywords on average. CONCLUSIONS The sorting process empowers the embedding process, which improves the feature extraction process because it acts as a knowledge base for the prediction component. The model outperforms other current approaches that use flat extraction before prediction.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2025 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.