A k-mer Based Approach for SARS-CoV-2 Variant Identification

Ali, Sarwan; Sahoo, Bikram; Ullah, Naimat; Zelikovskiy, Alexander; Patterson, Murray; Khan, Imdadullah

doi:10.1007/978-3-030-91415-8_14

Cited by 46 publications

(61 citation statements)

References 30 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…2) We show that our method is scalable on larger datasets by using ≈2.5 million spike sequences. 3) We prove from the results that the machine learning models used in [27]- [29] are not scalable on these larger datasets. This robust checking helps us to analyze the machine learning models in detail in terms of their appropriateness for SARS-CoV-2 spike sequences.…”

Section: Introductionmentioning

confidence: 93%

“…Authors in [29] propose a one-hot encoding based approach to classify different coronavirus hosts using the spike portion of the virus rather than the entire sequence, obtaining near-optimal prediction accuracy. Ali et al in [27] perform classification of different variants of the human SARS-CoV-2. Although they were successful in achieving higher accuracy than in [29], the kernel method used in their approach, however, is not scalable to the size of the data we use in this study.…”

Section: Literature Reviewmentioning

confidence: 99%

“…While dealing with Big Data, it is important to analyze the trade-off between the prediction accuracy and the runtime [55]. Although Ali et al,in [27] use the kernel method for spike sequence classification, since the kernel computation is, however, expensive in terms of time and space, their approach is only a proof of concept, and not feasible in a real-world scenario.…”

Section: Literature Reviewmentioning

confidence: 99%

“…The purpose of using smaller training dataset is to show how much performance gain we can achieve while using minimal training data. Note that our data split and preprocessing follow those of [27].…”

Section: A Experimental Setupmentioning

confidence: 99%

“…Since the spike region is sufficient to characterize most of the important features of a viral sample, yet is much smaller in length, as depicted in Figure 1, we focus on an embedding approach tailored to the spike region of the sequences. Previously, some efforts have been done to perform classification and clustering of SARS-CoV-2 spike sequences [27]- [29]. However, those methods are not scalable to the amount of data we use in this study.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Spike2Vec: An Efficient and Scalable Embedding Approach for COVID-19 Spike Sequences

Ali

2021

2021 IEEE International Conference on Big Data (Big Data)

Self Cite

View full text Add to dashboard Cite

With the rapid global spread of COVID-19, more and more data related to this virus is becoming available, including genomic sequence data. The total number of genomic sequences that are publicly available on platforms such as GISAID is currently several million, and is increasing with every day. The availability of such Big Data creates a new opportunity for researchers to study this virus in detail. This is particularly important with all of the dynamics of the COVID-19 variants which emerge and circulate. This rich data source will give us insights on the best ways to perform genomic surveillance for this and future pandemic threats, with the ultimate goal of mitigating or eliminating such threats. Analyzing and processing the several million genomic sequences is a challenging task. Although traditional methods for sequence classification are proven to be effective, they are not designed to deal with these specific types of genomic sequences. Moreover, most of the existing methods also face the issue of scalability. Previous studies which were tailored to coronavirus genomic data proposed to use spike sequences (corresponding to a subsequence of the genome), rather than using the complete genomic sequence, to perform different machine learning (ML) tasks such as classification and clustering. However, those methods suffer from scalability issues.In this paper, we propose an approach called Spike2Vec, an efficient and scalable feature vector representation for each spike sequence that can be used for downstream ML tasks. Through experiments, we show that Spike2Vec is not only scalable on several million spike sequences, but also outperforms the baseline models in terms of prediction accuracy, F1 score, etc. Since this type of study on such huge numbers of spike sequences has not been done before (to the best of our knowledge), we believe that it will open new doors for researchers to use this data and perform different tasks to unfold new information that was not available before. We also use information gain (IG) to compute the importance of each amino acid in the spike sequence. The amino acids with higher IG values tend to be the same as many reported by the USA based Centers for Disease Control and Prevention (CDC) for different variants.

show abstract

Section: Introductionmentioning

confidence: 93%