HaploCoV: unsupervised classification and rapid detection of novel emerging variants of SARS-CoV-2

Chiara, Matteo; Horner, David Stephen; Ferrandi, Erika; Gissi, Carmela; Pesole, Graziano

doi:10.1038/s42003-023-04784-4

Cited by 5 publications

(9 citation statements)

References 48 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Monkeypox genome sequences and associated metadata were accessed through the Nexstrain resource 57 , low-quality genome sequences were discarded according to the same criteria used for SARS-CoV-2, and mutations were identified by applying the HaploCoV workflow 27 on a collection of 2,526 high-quality sequences. A total of 4,932 distinct mutations were detected.…”

Section: Resultsmentioning

confidence: 99%

“…The original data included genome sequences in FASTA format and associated metadata (accession ID, collection date, submission date, Pangolin lineage, and collection location). These were processed by the HaploCoV pipeline 28 to derive a large table with the list of mutations and matched metadata for every genome sequence.…”

Section: Methodsmentioning

confidence: 99%

“…As the COVID-19 pandemic progressed, research interests shifted toward the study of mutational signatures and variants associated with increased transmission rates and reduced antigenicity, and possibly hampering testing, treatment, and vaccine development [21][22][23] . A number of methods were proposed to allow automatic early detection of variants [24][25][26][27][28] . Instead, interest in the automatic identification of recombination in SARS-CoV-2 started at a later stage.…”

Section: Mainmentioning

confidence: 99%

“…We considered 15,271,031 SARS-CoV-2 genomes, downloaded from the GISAID database 20 on April 1st, 2023. Genome sequences were aligned to the SARS-CoV-2 reference genome and nucleotide mutations were identified by the HaploCoV pipeline 28 . To mitigate the impacts of sequencing and assembly errors, genome sequences of uncertain/low quality were excluded (i.e., records associated with low coverage, percentage of unknown bases ≥ 2%, unknown length, and incomplete metadata).…”

Section: Mainmentioning

confidence: 99%

See 3 more Smart Citations

Data-driven recombination detection in viral genomes

Alfonsi

Bernasconi

Chiara

et al. 2023

Preprint

Self Cite

View full text Add to dashboard Cite

Recombination is a key molecular mechanism for the evolution and adaptation of viruses. The first recombinant SARS-CoV-2 genomes were recognized in 2021; as of today, more than seventy SARS-CoV-2 lineages are designated as recombinant. In the wake of the COVID-19 pandemic, several methods for detecting recombination in SARS-CoV-2 have been proposed; however, none could faithfully reproduce manual analyses by experts in the field. We hereby present RecombinHunt, a novel, automated method for the identification of recombinant genomes purely based on a data-driven approach. RecombinHunt compares favorably with other state-of-the-art methods and recognizes recombinant SARS-CoV-2 genomes (or lineages) with one or two breakpoints with high accuracy, within reduced turn-around times and small discrepancies with respect to the expert manually-curated standard nomenclature. Strikingly, applied to the complete collection of viral sequences from the recent monkeypox epidemic, RecombinHunt identifies recombinant viral genomes in high concordance with manually curated analyses by experts, suggesting that our approach is robust and can be applied to any epidemic/pandemic virus. Although RecombinHunt does not substitute manual expert curation based on phylogenetic analysis, we believe that our method represents a breakthrough for the detection of recombinant viral lineages in pandemic/epidemic scenarios.

show abstract

Section: Resultsmentioning

confidence: 99%

Section: Methodsmentioning

confidence: 99%

Section: Mainmentioning

confidence: 99%

Section: Mainmentioning

confidence: 99%

See 2 more Smart Citations

Data-driven recombination detection in viral genomes

Alfonsi

Bernasconi

Chiara

et al. 2023

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…As the COVID-19 pandemic progressed, research interests shifted toward the study of mutational signatures and variants associated with increased transmission rates and reduced antigenicity, and possibly hampering testing, treatment, and vaccine development 20 – 22 . A number of methods were proposed to allow automatic early detection of variants 23 – 27 . Instead, interest in the automatic identification of recombination in SARS-CoV-2 started at a later stage.…”

Section: Introductionmentioning

confidence: 99%

Data-driven recombination detection in viral genomes

Alfonsi,

Bernasconi,

Chiara

et al. 2024

Nat Commun

Self Cite

View full text Add to dashboard Cite

Recombination is a key molecular mechanism for the evolution and adaptation of viruses. The first recombinant SARS-CoV-2 genomes were recognized in 2021; as of today, more than ninety SARS-CoV-2 lineages are designated as recombinant. In the wake of the COVID-19 pandemic, several methods for detecting recombination in SARS-CoV-2 have been proposed; however, none could faithfully confirm manual analyses by experts in the field. We hereby present RecombinHunt, an original data-driven method for the identification of recombinant genomes, capable of recognizing recombinant SARS-CoV-2 genomes (or lineages) with one or two breakpoints with high accuracy and within reduced turn-around times. ReconbinHunt shows high specificity and sensitivity, compares favorably with other state-of-the-art methods, and faithfully confirms manual analyses by experts. RecombinHunt identifies recombinant viral genomes from the recent monkeypox epidemic in high concordance with manually curated analyses by experts, suggesting that our approach is robust and can be applied to any epidemic/pandemic virus.

show abstract

AltaiR: a C toolkit for alignment-free and temporal analysis of multi-FASTA data

Silva,

Pinho,

Pratas

2024

GigaScience

View full text Add to dashboard Cite

Background Most viral genome sequences generated during the latest pandemic have presented new challenges for computational analysis. Analyzing millions of viral genomes in multi-FASTA format is computationally demanding, especially when using alignment-based methods. Most existing methods are not designed to handle such large datasets, often requiring the analysis to be divided into smaller parts to obtain results using available computational resources. Findings We introduce AltaiR, a toolkit for analyzing multiple sequences in multi-FASTA format using exclusively alignment-free methodologies. AltaiR enables the identification of singularity and similarity patterns within sequences and computes static and temporal dynamics without restrictions on the number or size of input sequences. It automatically filters low-quality, biased, or deviant data. We demonstrate AltaiR’s capabilities by analyzing more than 1.5 million full severe acute respiratory virus coronavirus 2 sequences, revealing interesting observations regarding viral genome characteristics over time, such as shifts in nucleotide composition, decreases in average Kolmogorov sequence complexity, and the evolution of the smallest sequences not found in the human host. Conclusions AltaiR can identify temporal characteristics and trends in large numbers of sequences, making it ideal for scenarios involving endemic or epidemic outbreaks with vast amounts of available sequence data. Implemented in C with multithreading and methodological optimizations, AltaiR is computationally efficient, flexible, and dependency-free. It accepts any sequence in FASTA format, including amino acid sequences. The complete toolkit is freely available at https://github.com/cobilab/altair.

show abstract

HaploCoV: unsupervised classification and rapid detection of novel emerging variants of SARS-CoV-2

Cited by 5 publications

References 48 publications

Data-driven recombination detection in viral genomes

Data-driven recombination detection in viral genomes

Data-driven recombination detection in viral genomes

AltaiR: a C toolkit for alignment-free and temporal analysis of multi-FASTA data

Contact Info

Product

Resources

About