Spike2Vec: An Efficient and Scalable Embedding Approach for COVID-19 Spike Sequences

Ali, Sarwan

doi:10.1109/bigdata52589.2021.9671848

Cited by 53 publications

(44 citation statements)

References 54 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In this section, we present our results for PWM2Vec and compare its performance with the baseline one-hot embedding (OHE) and the more recent k-mer-based embedding approach, which has shown to be an improvement over OHE [33,34]. For classification, we also show the results for the feature selection method (ridge regression) for all embedding approaches.…”

Section: Resultsmentioning

confidence: 99%

“…This section proposes an approach, PWM2Vec, to generate a fixed-length numerical feature embedding from coronavirus spike sequences for host specification. We also discuss the baseline approaches, specifically one-hot embedding (OHE) [32,34] and k-mer-based feature embedding [33,34]. We perform feature selection using ridge regression [70] on the resulting embedding before applying machine learning (ML) algorithms.…”

Section: Proposed Approachmentioning

confidence: 99%

“…In our experiments to generate k-mer-based frequency vectors, we used k = 3 (as done in [33,34]). After generating the k-mers, we created a feature vector Φ (a frequency vector), which contains the frequency (count) of each k-mer occurring in the sequence [33,34]. Given some sequence σ with alphabet Σ, the length of feature vector Φ k (σ) will be |Σ| k .…”

Section: K-mer-based Frequency Vectorsmentioning

confidence: 99%

“…This behavior can cause parallelism and multicollinearity (when multiple features are correlated with each other) in high dimensions. The authors of [33,34] used the coronavirus spike sequences to classify different variants of COVID-19 using k-mer-based frequency vectors. Researchers have performed clustering on the COVID-19 spike sequences using the same k-mer-based frequency vector generation approach [35,36].…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

PWM2Vec: An Efficient Embedding Approach for Viral Host Specification from Coronavirus Spike Sequences

Ali

Bello

Chourasia

et al. 2022

Biology

Self Cite

View full text Add to dashboard Cite

The study of host specificity has important connections to the question about the origin of SARS-CoV-2 in humans which led to the COVID-19 pandemic—an important open question. There are speculations that bats are a possible origin. Likewise, there are many closely related (corona)viruses, such as SARS, which was found to be transmitted through civets. The study of the different hosts which can be potential carriers and transmitters of deadly viruses to humans is crucial to understanding, mitigating, and preventing current and future pandemics. In coronaviruses, the surface (S) protein, or spike protein, is important in determining host specificity, since it is the point of contact between the virus and the host cell membrane. In this paper, we classify the hosts of over five thousand coronaviruses from their spike protein sequences, segregating them into clusters of distinct hosts among birds, bats, camels, swine, humans, and weasels, to name a few. We propose a feature embedding based on the well-known position weight matrix (PWM), which we call PWM2Vec, and we use it to generate feature vectors from the spike protein sequences of these coronaviruses. While our embedding is inspired by the success of PWMs in biological applications, such as determining protein function and identifying transcription factor binding sites, we are the first (to the best of our knowledge) to use PWMs from viral sequences to generate fixed-length feature vector representations, and use them in the context of host classification. The results on real world data show that when using PWM2Vec, machine learning classifiers are able to perform comparably to the baseline models in terms of predictive performance and runtime—in some cases, the performance is better. We also measure the importance of different amino acids using information gain to show the amino acids which are important for predicting the host of a given coronavirus. Finally, we perform some statistical analyses on these results to show that our embedding is more compact than the embeddings of the baseline models.

show abstract

Section: Resultsmentioning

confidence: 99%

Section: Proposed Approachmentioning

confidence: 99%

Section: K-mer-based Frequency Vectorsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

PWM2Vec: An Efficient Embedding Approach for Viral Host Specification from Coronavirus Spike Sequences

Ali

Bello

Chourasia

et al. 2022

Biology

Self Cite

View full text Add to dashboard Cite

show abstract

“…Fast and efficient solutions to the clade assignment problem would help in tracking current and evolving strains and it is crucial for the surveillance of the pathogen. This classification problem has been attacked with machine learning approaches [3,4,5] using the Spike protein amino acid sequence to drive the classification step.…”

Section: Introductionmentioning

confidence: 99%

Accurate and Fast Clade Assignment via Deep Learning and Frequency Chaos Game Representation

Cartes

Anand

Ciccolella

et al. 2022

Preprint

View full text Add to dashboard Cite

Background: Since the beginning of the COVID-19 pandemic there has been an explosion of sequencing of the SARS-CoV-2 virus, making it the most widely sequenced virus in the history. Several databases and tools have been created to keep track of genome sequences and variants of the virus, most notably the GISAID platform hosts millions of complete genome sequences, and it is continuously expanding every day. A challenging task is the development of fast and accurate tools that are able to distinguish between the different SARS-CoV-2 variants and assign them to a clade. Results: In this paper, we leverage the Frequency Chaos Game Representation (FCGR) and Convolutional Neural Networks (CNNs) to develop an original method that learns how to classify genome sequences that we implement into CouGaR-g, a tool for the clade assignment problem on SARS-CoV-2 sequences. On a testing subset of the GISAID, CouGaR-g achieves an 96.29% overall accuracy, while a similar tool, Covidex, obtained a 77,12% overall accuracy. As far as we know, our method is the first using Deep Learning and FCGR for intra-species classification. Furthermore, by using some feature importance methods CouGaR-g allows to identify k-mers that matches SARS-CoV-2 marker variants. Conclusions: By combining FCGR and CNNs, we develop a method that achieves a better accuracy than Covidex (which is based on Random Forest) for clade assignment of SARS-CoV-2 genome sequences, also thanks to our training on a much larger dataset, with comparable running times. Our method implemented in CouGaR-g is able to detect k-mers that capture relevant biological information that distinguishes the clades, known as marker variants. Availability: The trained models can be tested online providing a FASTA file (with one or multiple sequences) at https://huggingface.co/spaces/BIASLab/sars-cov-2-classification-fcgr. CouGaR-g is also available at https://github.com/AlgoLab/CouGaR-g under the GPL.

show abstract

A k-mer Based Approach for SARS-CoV-2 Variant Identification

Ali

Sahoo

Ullah

et al. 2021

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Spike2Vec: An Efficient and Scalable Embedding Approach for COVID-19 Spike Sequences

Cited by 53 publications

References 54 publications

PWM2Vec: An Efficient Embedding Approach for Viral Host Specification from Coronavirus Spike Sequences

PWM2Vec: An Efficient Embedding Approach for Viral Host Specification from Coronavirus Spike Sequences

Accurate and Fast Clade Assignment via Deep Learning and Frequency Chaos Game Representation

A k-mer Based Approach for SARS-CoV-2 Variant Identification

Contact Info

Product

Resources

About