The global efforts to control COVID-19 are threatened by the rapid emergence of novel SARS-CoV-2 variants that may display undesirable characteristics such as immune escape, increased transmissibility or pathogenicity. Early prediction for emergence of new strains with these features is critical for pandemic preparedness. We present Strainflow, a supervised and causally predictive model using unsupervised latent space features of SARS-CoV-2 genome sequences. Strainflow was trained and validated on 0.9 million sequences for the period December, 2019 to June, 2021 and the frozen model was prospectively validated from July, 2021 to December, 2021. Strainflow captured the rise in cases 2 months ahead of the Delta and Omicron surges in most countries including the prediction of a surge in India as early as beginning of November, 2021. Entropy analysis of Strainflow unsupervised embeddings clearly reveals the explore-exploit cycles in genomic feature-space, thus adding interpretability to the deep learning based model. We also conducted codon-level analysis of our model for interpretability and biological validity of our unsupervised features. Strainflow application is openly available as an interactive web-application for prospective genomic surveillance of COVID-19 across the globe.
The global efforts to control COVID-19 are threatened by the rapid emergence of novel variants that may display undesirable characteristics such as immune escape or increased pathogenicity. The current approaches to genomic surveillance do not allow early prediction of emerging variations. Here, we derive Dimensions of Concern (DoC) in the latent space of SARS-CoV-2 mutations and demonstrate their potential to provide a lead time for predicting the increase of new cases in 9 countries across the globe. We learned unsupervised word embeddings from 3,09,060 spike protein coding sequences deposited on GISAID database until April, 2021. We discovered that "blips" in the latent dimensions of embeddings are associated with mutations. We modeled the temporal occurrence of blips and their relationships with the number of new cases in the following months for these countries. Certain dimensions demonstrated a consistent leading relationship between the occurrence of blips and the number of new cases in the following months, thus labeled as potential Dimensions of Concern, DoCs. We validated the predictive importance of DoCs by performing Random Forest-based feature selection and modeling in a temporally split training, validation, testing regime. Twelve dimensions achieved statistical significance and achieved an R-squared of 37% for prediction of number of new cases in the following month. Biological exploration of DoCs revealed that dimensions 3 and 12 captures 3-mers CGG, ACG and CAC that are associated with known variants L452R, K417T and Q677H respectively. Learning and tracking DoCs is extensible to related challenges such as pandemic preparedness, immune escape, pathogenicity modeling and antimicrobial resistance.
Social contact mixing patterns are critical to the transmission of communicable diseases and have been employed to model disease outbreaks including COVID-19. Nonetheless, there is a paucity of studies on contact mixing in low and middle-income countries such as India. Furthermore, mathematical models of disease outbreaks do not account for the temporal nature of social contacts. We conducted a longitudinal study of social contacts in rural north India across three seasons and analysed the temporal differences in contact patterns. A contact diary survey was performed across three seasons from October 2015-16, in which participants were queried on the number, duration, and characteristics of contacts that occurred on the previous day. A total of 8,421 responses from 3,052 respondents (49% females) recorded characteristics of 180,073 contacts. Respondents reported a significantly higher number and duration of contacts in the winter, followed by the summer and the monsoon season (Nemenyi post-hoc, p<0.001). Participants aged 0-9 years and 10-19 years of age reported the highest median number of contacts (16 (IQR 12-21), 17 (IQR 13-24) respectively) and were found to have the highest node centrality in the social network of the region (pageranks = 0.20, 0.17). Employed males across all age groups were found to have a higher number of contacts than unemployed males (Negative Binomial Regression: rate ratio 1.18, 95% CI: 1.05-1.31). A large proportion (>80%) of contacts that were reported in schools or on public transport involved physical contact. To the best of our knowledge, our study is the first from India to show that contact mixing patterns vary by the time of the year and provides useful implications for pandemic control. Our results can be used to parameterize more accurate mathematical models for prediction of epidemiological trends of infections in rural India.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.