2022
DOI: 10.1038/s41598-022-20000-5
|View full text |Cite
|
Sign up to set email alerts
|

Tracking mutational semantics of SARS-CoV-2 genomes

Abstract: Natural language processing (NLP) algorithms process linguistic data in order to discover the associated word semantics and develop models that can describe or even predict the latent meanings of the data. The applications of NLP become multi-fold while dealing with dynamic or temporally evolving datasets (e.g., historical literature). Biological datasets of genome-sequences are interesting since they are sequential as well as dynamic. Here we describe how SARS-CoV-2 genomes and mutations thereof can be proces… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

0
2
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
2
1
1

Relationship

0
4

Authors

Journals

citations
Cited by 4 publications
(2 citation statements)
references
References 41 publications
0
2
0
Order By: Relevance
“…Using NLP techniques, particularly the Word2Vec model for viral embedding, has been recognized in prior research. But while earlier studies were often unimodal and focused on tasks like viral classification or evolution tracking, our method differs by integrating this with other data modalities, offering a more comprehensive view [48][49][50][51] . The merit of this method is evident in its ability to encapsulate the cumulative effects of multiple viral mutations and their relationships, a task that single-modal approaches might find challenging.…”
Section: Discussionmentioning
confidence: 99%
“…Using NLP techniques, particularly the Word2Vec model for viral embedding, has been recognized in prior research. But while earlier studies were often unimodal and focused on tasks like viral classification or evolution tracking, our method differs by integrating this with other data modalities, offering a more comprehensive view [48][49][50][51] . The merit of this method is evident in its ability to encapsulate the cumulative effects of multiple viral mutations and their relationships, a task that single-modal approaches might find challenging.…”
Section: Discussionmentioning
confidence: 99%
“…The 4-week period has been chosen as the most appropriate trade-off that captures the weekly periodicity of data collections/submissions and allows extracting amounts of sequences that grant sufficient statistical significance for the trend measurements. One-month periods are typically employed in works that aim to track variants and provide early warning for them ( 14 , 17 , 18 ). The complete set of amino acid changes associated with genome sequences \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$M^{\prime}= \{ m|\exists s \in S^{\prime}\wedge m \in s.M\} $\end{document} is then derived and—for every mutation—levels of prevalence in the first, second, third and fourth week are computed \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$\left( {{p_1},{p_2},{p_3},{p_4} \in \left[ {0,100} \right]} \right)$\end{document} .…”
Section: Construction and Contentmentioning
confidence: 99%