Global sequencing of the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) has enabled researchers to monitor the genetic information of new pathogenic variants. However, the integrity and accuracy of most genomic information cannot be guaranteed due to limited sample quality and sequencing technology. This restricts researchers from tracking the latest changes in the virus genome. In this study, we aimed to characterize the diversity and conservation balance of complete SARS-CoV-2 genomes worldwide since the outbreak. To achieve this goal, we collected 5,966,490 genome sequences from various parts of the world, excluding those with incomplete or unknown/degenerate sequences. Our methodology included comparing sequences using BLASTN and BLASTP, analyzing RNA secondary structure, and calculating entropy. Our findings provide insights into the characteristics of SARS-CoV-2 and potential mechanisms of pathogenesis. We found that uracil had the highest proportion of all bases among various coronaviruses and cytosine to uracil mutations had the highest proportion among all point mutations. The consistency in the front part (1–26,599 nt) was significantly higher than that in the back part (26,600–29,903 nt) of the genome. For most genes, the similar consistency characteristics were also observed in their protein families.
IMPORTANCE
Our results indicate that most severe acute respiratory syndrome coronavirus 2 genomes sampled from patients had a mutation rate ≤1.07 ‰ and genome-tail proteins (including S protein) were the main sources of genetic polymorphism. The analysis of the virus-host interaction network of genome-tail proteins showed that they shared some antiviral signaling pathways, especially the intracellular protein transport pathway.