Accurate and timely monitoring of the evolution of SARS-CoV-2 is crucial for identifying and tracking potentially more transmissible/virulent viral variants, and implement mitigation strategies to limit their spread. Here we introduce HaploCoV, a novel software framework that enables the exploration of SARS-CoV-2 genomic diversity through space and time, to identify novel emerging viral variants and prioritize variants of potential epidemiological interest in a rapid and unsupervised manner. HaploCoV can integrate with any classification/nomenclature and incorporates an effective scoring system for the prioritization of SARS-CoV-2 variants. By performing retrospective analyses of more than 11.5 M genome sequences we show that HaploCoV demonstrates high levels of accuracy and reproducibility and identifies the large majority of epidemiologically relevant viral variants - as flagged by international health authorities – automatically and with rapid turn-around times.Our results highlight the importance of the application of strategies based on the systematic analysis and integration of regional data for rapid identification of novel, emerging variants of SARS-CoV-2. We believe that the approach outlined in this study will contribute to relevant advances to current and future genomic surveillance methods.
Accurate and timely monitoring of emerging genomic diversity is crucial for limiting the spread of potentially more transmissible/pathogenic strains of SARS-CoV-2. At the time of writing, over 1.8M distinct viral genome sequences have been made publicly available, and a sophisticated nomenclature system based on phylogenetic evidence and expert manual curation has allowed the relatively rapid classification of emerging lineages of potential concern. Here, we propose a complementary approach that integrates fine-grained spatiotemporal estimates of allele frequency with unsupervised clustering of viral haplotypes, and demonstrate that multiple highly frequent genetic variants, arising within large and/or rapidly expanding SARS-CoV-2 lineages, have highly biased geographic distributions and are not adequately captured by current SARS-CoV-2 nomenclature standards. Our results advocate a partial revision of current methods used to track SARS-CoV-2 genomic diversity and highlight the importance of the application of strategies based on the systematic analysis and integration of regional data. Here we provide a complementary, completely automated and reproducible framework for the mapping of genetic diversity in time and across different geographic regions, and for the prioritization of virus variants of potential concern. We believe that the approach outlined in this study will contribute to relevant advances to current genomic surveillance methods.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.