16The World Health Organization characterized the COVID-19 as a pandemic in March 2020, the second 17 pandemic of the 21 st century. Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) is a 18 positive-stranded RNA betacoronavirus of the family Coronaviridae. Expanding virus populations, as 19 that of SARS-CoV-2, accumulate a number of narrowly shared polymorphisms imposing a 20 confounding effect on traditional clustering methods. In this context, approaches that reduce the 21 complexity of the sequence space occupied by the SARS-CoV-2 population are necessary for a robust 22 clustering. Here, we proposed the subdivision of the global SARS-CoV-2 population into sixteen well-23 defined subtypes by focusing on the widely shared polymorphisms in nonstructural (nsp3, nsp4, nsp6, 24 nsp12, nsp13 and nsp14) cistrons, structural (spike and nucleocapsid) and accessory (ORF8) genes.
25Six virus subtypes were predominant in the population, but all sixteen showed amino acid 26 replacements which might have phenotypic implications. We hypothesize that the virus subtypes 27 detected in this study are records of the early stages of the SARS-CoV-2 diversification that were 28 randomly sampled to compose the virus populations around the world, a typical founder effect. The 29 genetic structure determined for the SARS-CoV-2 population provides substantial guidelines for 30 maximizing the effectiveness of trials for testing the candidate vaccines or drugs.
Main
32In December 2019, a local pneumonia outbreak of initially unknown etiology was detected in 33 Wuhan (Hubei, China) and quickly determined to be caused by a novel coronavirus 1 , named Severe 34 acute respiratory syndrome coronavirus 2 (SARS-CoV-2) 2 and the disease as COVID-19 3 . SARS-
35CoV-2 is classified in the family Coronaviridae, genus Betacoronavirus, which comprises enveloped, 36 positive stranded RNA viruses of vertebrates 2 . Two-thirds of SARS-CoVs genome is covered by the 37 ORF1ab, that encodes a large polypeptide which is cleaved into 16 nonstructural proteins (NSPs) 38 involved in replication-transcription in vesicles from endoplasmic reticulum (ER)-derived 39 membranes 4,5 . The last third of the virus genome encodes four essential structural proteins: spike (S), 40 envelope (E), membrane (M), nucleocapsid (N) and several accessory proteins that interfere with the 41 host innate immune response 6 .
42Populations of RNA viruses evolve rapidly due to their large population sizes, short generation 43 times, and high mutation rates, this latter being a consequence of the RNA-dependent RNA 44 polymerase (RdRP) which lacks the proofreading activity 7 . In fact, virus populations are composed of 45 a broad spectrum of closely related genetic variants resembling one or more master sequences [8][9][10] . 46 Mutation rates inferred for SARS-CoVs are considered moderate 11,12 due to the independent 47 proofreading activity 13 . However, the large SARS-CoV genomes (from 27 to 31 kb) 14 provide to them 48 the ability to explore the sequence spa...